Back to index

4.13.2

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.12.67

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Currently the Get started with on-premise host inventory quickstart gets delivered in the Core console. If we are going to keep it here we need to add the MCE or ACM operator as a prerequisite, otherwise it's very confusing.

Epic Goal

  • Make it possible to disable the console operator at install time, while still having a supported+upgradeable cluster.

Why is this important?

  • It's possible to disable console itself using spec.managementState in the console operator config. There is no way to remove the console operator, though. For clusters where an admin wants to completely remove console, we should give the option to disable the console operator as well.

Scenarios

  1. I'm an administrator who wants to minimize my OpenShift cluster footprint and who does not want the console installed on my cluster

Acceptance Criteria

  • It is possible at install time to opt-out of having the console operator installed. Once the cluster comes up, the console operator is not running.

Dependencies (internal and external)

  1. Composable cluster installation

Previous Work (Optional):

  1. https://docs.google.com/document/d/1srswUYYHIbKT5PAC5ZuVos9T2rBnf7k0F1WV2zKUTrA/edit#heading=h.mduog8qznwz
  2. https://docs.google.com/presentation/d/1U2zYAyrNGBooGBuyQME8Xn905RvOPbVv3XFw3stddZw/edit#slide=id.g10555cc0639_0_7

Open questions::

  1. The console operator manages the downloads deployment as well. Do we disable the downloads deployment? Long term we want to move to CLI manager: https://github.com/openshift/enhancements/blob/6ae78842d4a87593c63274e02ac7a33cc7f296c3/enhancements/oc/cli-manager.md

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Update the cluster-authentication-operator to not go degraded when it can’t determine the console url.  This risks masking certain cases where we would want to raise an error to the admin, but the expectation is that this failure mode is rare.

Risk could be avoided by looking at ClusterVersion's enabledCapabilities to decide if missing Console was expected or not (unclear if the risk is high enough to be worth this amount of effort).

AC: Update the cluster-authentication-operator to not go degraded when console config CRD is missing and ClusterVersion config has Console in enabledCapabilities.

We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.

Goals

  • To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
  • When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
  • Reduce storage test flakiness so we can spot real bugs in our CI.

Requirements

Requirement Notes isMvp?
Telemetry   No
Certification   No
API metrics   No
     

Out of Scope

n/a

Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low

Assumptions

Customer Considerations

Documentation Considerations

  • Target audience: internal
  • Updated content: none at this time.

Notes

In progress:

  • CI flakes:
    • Configurable timeouts for e2e tests
      • Azure is slow and times out often
      • Cinder times out formatting volumes
      • AWS resize test times out

 

High prio:

  • Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
    • Should it be part of the installer?
    • Spike exists
  • Add / use cloud API call metrics
    • Helps customers to understand why things are slow
    • Helps build cop to understand a flake
      • With a post-install step that filters data from Prometheus that’s still running in the CI job.
    • Ideas:
      • Cloud is throttling X% of API calls longer than Y seconds
      • Attach / detach / provisioning / deletion / mount / unmount / resize takes longer than X seconds?
    • Capture metrics of operations that are stuck and won’t finish.
      • Sweep operation map from executioner???
      • Report operation metric into the highest bucket after the bucket threshold (i.e. if 10minutes is the last bucket, report an operation into this bucket after 10 minutes and don’t wait for its completion)?
      • Ask the monitoring team?
    • Include in CSI drivers too.
      • With alerts too

Unsorted

  • As the number of storage operators grows, it would be grafana board for storage operators
    • CSI driver metrics (from CSI sidecars + the driver itself  + its operator?)
    • CSI migration?
  • Get aggregated logs in cluster
    • They're rotated too soon
    • No logs from dead / restarted pods
    • No tools to combine logs from multiple pods (e.g. 3 controller managers)
  • What storage issues customers have? it was 22% of all issues.
    • Insufficient docs?
    • Probably garbage
  • Document basic storage troubleshooting for our supports
    • What logs are useful when, what log level to use
    • This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.
  • Common vSphere errors, their debugging and fixing. 
  • Document sig-storage flake handling - not all failed [sig-storage] tests are ours

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

There is a new driver release 5.0.0 since the last rebase that includes snapshot support:

https://github.com/kubernetes-sigs/ibm-vpc-block-csi-driver/releases/tag/v5.0.0

Rebase the driver on v5.0.0 and update the deployments in ibm-vpc-block-csi-driver-operator.
There are no corresponding changes in ibm-vpc-node-label-updater since the last rebase.

Background and Goal

Currently in OpenShift we do not support distributing hotfix packages to cluster nodes. In time-sensitive situations, a RHEL hotfix package can be the quickest route to resolving an issue. 

Acceptance Criteria

  1. Under guidance from Red Hat CEE, customers can deploy RHEL hotfix packages to MachineConfigPools.
  2. Customers can easily remove the hotfix when the underlying RHCOS image incorporates the fix.

Before we ship OCP CoreOS layering in https://issues.redhat.com/browse/MCO-165 we need to switch the format of what is currently `machine-os-content` to be the new base image.

The overall plan is:

  • Publish the new base image as `rhel-coreos-8` in the release image
  • Also publish the new extensions container (https://github.com/openshift/os/pull/763) as `rhel-coreos-8-extensions`
  • Teach the MCO to use this without also involving layering/build controller
  • Delete old `machine-os-content`

We need something in our repo /docs that we can point people to that briefly explains how to use "layering features" via the MCO in OCP ( well, and with the understanding that OKD also uses the MCO ). 

Maybe this ends up in its own repo like https://github.com/coreos/coreos-layering-examples eventually, maybe it doesn't.

I'm thinking something like https://github.com/openshift/machine-config-operator/blob/layering/docs/DemoLayering.md back from when we did the layering branch, but actually matching what we have in our main branch

This is separate but probably related to what Colin started in the Docs Tracker. 

Feature Overview

  • Follow up work for the new provider, Nutanix, to extend extisting capabilities with new ones

Goals

  • Make Nutanix CSI Driver part of the CVO once the driver and the Operator has been open sourced by the vendor
  • Enable IPI for disconnected environments
  • Enable the UPI workflow
  • Nutanix CCM for the Node Controller
  • Enable Egress IP for the provider

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow users to have nutanix platfrom integration choice (similar to vsphere) from AI SaaS

Why is this important?

  • Expend RH offering beyond IPI

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

BE 2.13.0, In Nutanix, UMN flow, If machine_network = [] , bootstrap validation failed.

How reproducible:

Trying to reproduce

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

 

Why?

  • Decouple control and data plane. 
    • Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
  • Improve security
    • Shift credentials out of cluster that support the operation of core platform vs workload
  • Improve cost
    • Allow a user to toggle what they don’t need.
    • Ensure a smooth path to scale to 0 workers and upgrade with 0 workers.

 

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

 

 

Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

Epic Goal

  • To improve debug-ability of ovn-k in hypershift
  • To verify the stability of of ovn-k in hypershift
  • To introduce a EgressIP reach-ability check that will work in hypershift

Why is this important?

  • ovn-k is supposed to be GA in 4.12. We need to make sure it is stable, we know the limitations and we are able to debug it similar to the self hosted cluster.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. This will need consultation with the people working on HyperShift

Previous Work (Optional):

  1. https://issues.redhat.com/browse/SDN-2589

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

CNCC was moved to the management cluster and it should use proxy settings defined for the management cluster.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

In testing dual stack on vsphere we discovered that kubelet will not allow us to specify two ips on any platform except baremetal. We have a couple of options to deal with that:

  • Wait for https://github.com/kubernetes/enhancements/pull/3706 to merge and be implemented upstream. This almost certainly means we miss 4.13.
  • Wait for https://github.com/kubernetes/enhancements/pull/3706 to merge and then implement the design downstream. This involves risk of divergence from the eventual upstream design. We would probably only ship this way as tech preview and provide support exceptions for major customers.
  • Remove the setting of nodeip for kubelet. This should get around the limitation on providing dual IPs, but it means we're reliant on the default kubelet IP selection logic, which is...not good. We'd probably only be able to support this on single nic network configurations.

GA CgroupV2 in 4.13 

Default with RHEL 9 

  1. Day 0 support for 4.13 where customer is able to change V1(default) to V2
  2. Day 1 where customer is able to change v1(default) to V2
  3. documentation on migration 
  4. Pinning existing clusters to V1 before upgrade to 4.13

From OCP - 4.13, the RCOS nodes by default come up with the "CGroupsV2" configuration

Command to verify on any OCP cluster node

stat -c %T -f /sys/fs/cgroup/

So, to avoid unexpected complications, if the `cgroupMode` is found to be empty in the `nodes.config` resource, `CGroupsv1` configuration needs to be explicitly set using the `machine-config-operator`

This user tracks the changes required to remove the TechPreview related checks in the MCO code to graduate the CGroupsV2 feature to GA.

Feature

As an Infrastructure Administrator, I want to deploy OpenShift on vSphere with supervisor (aka Masters) and worker nodes (from a MachineSet) across multiple vSphere data centers and multiple vSphere clusters using full stack automation (IPI) and user provided infrastructure (UPI).

 

MVP

Install OpenShift on vSphere using IPI / UPI in multiple vSphere data centers (regions) and multiple vSphere clusters in 1 vCenter, all in the same IPv4 subnet (in the same physical location).

  • Kubernetes Region contains vSphere datacenter and (single) vCenter name
  • Kubernetes Zone contains vSphere cluster, resource pool, datastore, network (port group)

Out of scope

  • There are no support the conversion of a non-zonal configuration (i.e. an existing OpenShift installation without 1+ zones) to a zonal configuration (1+ zones), but zonal UPI installation by the Infrastructure Administrator is permitted.

Scenarios for consideration:

  • OpenShift in vSphere across different zones to avoid single points of failure, whereby each node is in different ESX clusters within the same vSphere datacenter, but in different networks.
  • OpenShift in vSphere across multiple vSphere datacenter, while ensuring workers and masters are spread across 2 different datacenter in different subnets. (RFE-845, RFE-459).

Acceptance criteria:

  • Ensure vSphere IPI can successfully be deployed with ODF across the 3 zones (vSphere clusters) within the same vCenter [like we do with AWS, GCP & Azure].
  • Ensure zonal configuration in vSphere using UPI is documented and tested.

References: 

Epic Goal*

We need SPLAT-594 to be reflected in our CSI driver operator to support vSphere topology of storage GA.
 
Why is this important? (mandatory)

See SPLAT-320.
 
Scenarios (mandatory) 

As user, I want to edit Infrastructure object after OCP installation (or upgrade) to update cluster topology, so all newly provisioned PVs will get the new topology labels.

(With vSphere topology GA, we expect that users will be allowed to edit Infrastructure and change the cluster topology after cluster installation.)
 
Dependencies (internal and external) (mandatory)

  • SPLAT: [vsphere] Support Multiple datacenters and clusters GA.

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

It's possible that Infrastructure will remain read-only. No code on Storage side is expected then.

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

In a zonal deployments, it is possible that new failure-domains are added to the cluster.

In that case, we will have to most likely discover these new failure-domains and tag datastores in them, so as topology aware provisioning can work.

When STOR-1145 is merged, make sure that these new metrics are reported via telemetry to us.

 

Guide: https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#sending-metrics-via-telemetry-step-by-step

Exit criteria:

  • verify that metrics are reported in telemetry? I am not sure we have capabilities to test that, all code will be in monitoring repos.

I was thinking we will probably need a detailed metric for topology information about the cluster. Such as - how many failure-domains, how many datacenter and how many datastores.

We should create a metric and an alert if both ClusterCSIDriver and Infra object specify a topology.

Although such configuration is supported and Infra object will take precedence but it indicates an user error and hence user should be alerted about them.

As an openshift engineer make changes to various openshift components so that vSphere zonal installation is considered GA.

As a openshift engineer I need to follow the process to move the api from tech preview to ga so it can be used by clusters not installed with TechPreviewNoUpgrade.

more to follow...

As a openshift engineer depreciate existing vSphere platform spec parameters so that they can eventually be removed in favor of zonal.

Feature Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

The goal of this feature is to provide a consistent, predictable and deterministic approach on how the default storage class(es) is managed.

 
Why is this important? (mandatory)

The current default storage class implementation has corner cases which can result in PVs staying in pending because there is either no default storage class OR multiple storage classes are defined

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

 

No default storage class

In some cases there is no default SC defined, this can happen during OCP deployment where components such as the registry request a PV whereas the SC are not been defined yet. This can also happen during a change in default SC, there won't be any between the admin unset the current one and set the new on.

 

  1. The admin marks the current default SC1 as non-default.

Another user creates PVC requesting a default SC, by leaving pvc.spec.storageClassName=nil. The default SC does not exist at this point, therefore the admission plugin leaves the PVC untouched with pvc.spec.storageClassName=nil.
The admin marks SC2 as default.
PV controller, when reconciling the PVC, updates pvc.spec.storageClassName=nil to the new SC2.
PV controller uses the new SC2 when binding / provisioning the PVC.

  1. The installer creates PVC for the image registry first, requesting the default storage class by leaving pvc.spec.storageClassName=nil.

The installer creates a default SC.
PV controller, when reconciling the PVC, updates pvc.spec.storageClassName=nil to the new default SC.
PV controller uses the new default SC when binding / provisioning the PVC.

Multiple Storage Classes

In some cases there are multiple default SC, this can be an admin mistake (forget to unset the old one) or during the period where a new default SC is created but the old one is still present.

New behavior:

  1. Create a default storage class A
  2. Create a default storage class B
  3. Create PVC with pvc.spec.storageCLassName = nil

-> the PVC will get the default storage class with the newest CreationTimestamp (i.e. B) and no error should show.

-> admin will get an alert that there are multiple default storage classes and they should do something about it.

 

CSI that are shipped as part of OCP

The CSI drivers we ship as part of OCP are deployed and managed by RH operators. These operators automatically create a default storage class. Some customers don't like this approach and prefer to:

 

  1. Create their own default storage class
  2. Have no default storage class in order to disable dynamic provisioning

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

No external dependencies.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Can bring confusion to customer as there is a change in the default behavior customer are used to. This needs to be carefully documented.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

In 4.12 release, a new feature was introduced to oc-mirror allowing it to use OCI FBC catalogs as starting point for mirroring operators.

Overview

As a oc-mirror user, I would like the OCI FBC feature to be stable
so that I can use it in a production ready environment
and to make the new feature and all existing features of oc-mirror seamless

Current Status

This feature is ring-fenced in the oc mirror repository, it uses the following flags to achieve this so as not to cause any breaking changes in the current oc-mirror functionality.

  • --use-oci-feature
  • --oci-feature-action (copy or mirror)
  • --oci-registries-config

The OCI FBC (file base container) format has been delivered for Tech Preview in 4.12

Tech Enablement slides can be found here https://docs.google.com/presentation/d/1jossypQureBHGUyD-dezHM4JQoTWPYwiVCM3NlANxn0/edit#slide=id.g175a240206d_0_7

Design doc is in https://docs.google.com/document/d/1-TESqErOjxxWVPCbhQUfnT3XezG2898fEREuhGena5Q/edit#heading=h.r57m6kfc2cwt (also contains latest design discussions around the stories of this epic)

Link to previous working epic https://issues.redhat.com/browse/CFE-538

Contacts for the OCI FBC feature

 

As IBM user, I'd like to be able to specify the destination of the OCI FBC catalog in ImageSetConfig

So that I can control where that image is pushed to on the disconnected destination registry, because the path on disk to that OCI catalog doesn't make sense to be used in the component paths of the destination catalog.

Expected Inputs and Outputs - Counter Proposal

Examples provided assume that the current working directory is set to /tmp/cwdtest.

Instead of introducing a targetNamespace which is used in combination with targetName, this counter proposal introduces a targetCatalog field which supersedes the existing targetName field (which would be marked as deprecated). Users should transition from using targetName to targetCatalog, but if both happen to be specified, the targetCatalog is preferred and targetName is ignored. Any ISC that currently uses targetName alone should continue to be used as currently defined.

The rationale for targetCatalog is that some customers will have restrictions on where images can be placed. All IBM images always use a namespace. We therefore need a way to indicate where the CATALOG image is located within the context of the target registry... it can't just be placed in the root, so we need a way to configure this.

The targetCatalog field consists of an optional namespace followed by the target image name, described in extended Backus–Naur form below:

target-catalog = [namespace '/'] target-name
target-name    = path-component
namespace      = path-component ['/' path-component]*
path-component = alpha-numeric [separator alpha-numeric]*
alpha-numeric  = /[a-z0-9]+/
separator      = /[_.]|__|[-]*/

The target-name portion of targetCatalog represents the the image name in the final destination registry, and matches the definition/purpose of the targetName field. The namespace is only used for "placement" of the catalog image into the right "hierarchy" in the target registry. The target-name portion will be used in the catalog source metadata name, the file name of the catalog source, and target image reference.

Examples:

  • with namespace:

targetCatalog: foo/bar/baz/ibm-zcon-zosconnect-example

  • without namespace:

targetCatalog: ibm-zcon-zosconnect-example

Simple Flow

FBC image from docker registry

Command:

oc mirror -c /Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml --dest-skip-tls --dest-use-http docker://localhost:5000

ISC

/Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig: 
  local: 
    path: /tmp/localstorage
mirror: 
  operators: 
  - catalog: icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:6f02ecef46020bcd21bdd24a01f435023d5fc3943972ef0d9769d5276e178e76

ICSP

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/imageContentSourcePolicy.yaml

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata: 
  labels: 
    operators.openshift.org/catalog: "true"
  name: operator-0
spec: 
  repositoryDigestMirrors: 
  - mirrors: 
    - localhost:5000/cpopen
    source: icr.io/cpopen
  - mirrors: 
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

CatalogSource

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/catalogSource-ibm-zcon-zosconnect-catalog.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata: 
  name: ibm-zcon-zosconnect-catalog
  namespace: openshift-marketplace
spec: 
  image: localhost:5000/cpopen/ibm-zcon-zosconnect-catalog:6f02ec
  sourceType: grpc

Simple Flow With Target Namespace

FBC image from docker registry (putting images into a destination "namespace")

Command:

oc mirror -c /Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml --dest-skip-tls --dest-use-http docker://localhost:5000/foo

ISC

/Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig: 
  local: 
    path: /tmp/localstorage
mirror: 
  operators: 
  - catalog: icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:6f02ecef46020bcd21bdd24a01f435023d5fc3943972ef0d9769d5276e178e76

ICSP

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/imageContentSourcePolicy.yaml

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata: 
  labels: 
    operators.openshift.org/catalog: "true"
  name: operator-0
spec: 
  repositoryDigestMirrors: 
  - mirrors: 
    - localhost:5000/foo/cpopen
    source: icr.io/cpopen
  - mirrors: 
    - localhost:5000/foo/openshift4
    source: registry.redhat.io/openshift4

CatalogSource

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/catalogSource-ibm-zcon-zosconnect-catalog.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata: 
  name: ibm-zcon-zosconnect-catalog
  namespace: openshift-marketplace
spec: 
  image: localhost:5000/foo/cpopen/ibm-zcon-zosconnect-catalog:6f02ec
  sourceType: grpc

Simple Flow With TargetCatalog / TargetTag

FBC image from docker registry (overriding the catalog name and tag)

Command:

oc mirror -c /Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml --dest-skip-tls --dest-use-http docker://localhost:5000

ISC

/Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig: 
  local: 
    path: /tmp/localstorage
mirror: 
  operators: 
  - catalog: icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:6f02ecef46020bcd21bdd24a01f435023d5fc3943972ef0d9769d5276e178e76
    targetCatalog: cpopen/ibm-zcon-zosconnect-example # NOTE: namespace now has to be provided along with the 
                                                      # target catalog name to preserve the namespace in the resulting image
    targetTag: v123

ICSP

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/imageContentSourcePolicy.yaml

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata: 
  labels: 
    operators.openshift.org/catalog: "true"
  name: operator-0
spec: 
  repositoryDigestMirrors: 
  - mirrors: 
    - localhost:5000/cpopen
    source: icr.io/cpopen
  - mirrors: 
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

CatalogSource

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/catalogSource-ibm-zcon-zosconnect-example.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata: 
  name: ibm-zcon-zosconnect-example
  namespace: openshift-marketplace
spec: 
  image: localhost:5000/cpopen/ibm-zcon-zosconnect-example:v123
  sourceType: grpc

OCI Flow

FBC image from OCI path

In this example we're suggesting the use of a targetCatalog field.

Command:

oc mirror -c /Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml --dest-skip-tls --dest-use-http --use-oci-feature docker://localhost:5000

ISC

/Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig: 
  local: 
    path: /tmp/localstorage
mirror: 
  operators: 
  - catalog: oci:///foo/bar/baz/ibm-zcon-zosconnect-catalog/amd64 # This is just a path to the catalog and has no special meaning
    targetCatalog: foo/bar/baz/ibm-zcon-zosconnect-example # <--- REQUIRED when using OCI and optional for docker images 
                                                           #               value is used within the context of the target registry
    # targetTag: v123                                      # <--- OPTIONAL

ICSP

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/imageContentSourcePolicy.yaml

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata: 
  labels: 
    operators.openshift.org/catalog: "true"
  name: operator-0
spec: 
  repositoryDigestMirrors: 
  - mirrors: 
    - localhost:5000/cpopen
    source: icr.io/cpopen
  - mirrors: 
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

CatalogSource

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/catalogSource-ibm-zcon-zosconnect-example.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata: 
  name: ibm-zcon-zosconnect-example
  namespace: openshift-marketplace
spec: 
  image: localhost:5000/foo/bar/baz/ibm-zcon-zosconnect-example:6f02ec # Example uses "targetCatalog" set to 
                                                                       # "foo/bar/baz/ibm-zcon-zosconnect-example" at the 
                                                                       # destination registry localhost:5000
  sourceType: grpc

OCI Flow With Namespace

FBC image from OCI path (putting images into a destination "namespace" named "abc")

Command:

oc mirror -c /Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml --dest-skip-tls --dest-use-http --use-oci-feature docker://localhost:5000/abc

ISC

/Users/jhunkins/go/src/github.com/jchunkins/oc-mirror/ImageSetConfiguration.yml

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig: 
  local: 
    path: /tmp/localstorage
mirror: 
  operators: 
  - catalog: oci:///foo/bar/baz/ibm-zcon-zosconnect-catalog/amd64 # This is just a path to the catalog and has no special meaning
    targetCatalog: foo/bar/baz/ibm-zcon-zosconnect-example # <--- REQUIRED when using OCI and optional for docker images 
                                                           #               value is used within the context of the target registry
    # targetTag: v123                                      # <--- OPTIONAL

ICSP

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/imageContentSourcePolicy.yaml

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata: 
  labels: 
    operators.openshift.org/catalog: "true"
  name: operator-0
spec: 
  repositoryDigestMirrors: 
  - mirrors: 
    - localhost:5000/abc/cpopen
    source: icr.io/cpopen
  - mirrors: 
    - localhost:5000/abc/openshift4
    source: registry.redhat.io/openshift4

CatalogSource

/tmp/cwdtest/oc-mirror-workspace/results-1675716807/catalogSource-ibm-zcon-zosconnect-example-catalog.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata: 
  name: ibm-zcon-zosconnect-example
  namespace: openshift-marketplace
spec: 
  image: localhost:5000/abc/foo/bar/baz/ibm-zcon-zosconnect-example:6f02ec # Example uses "targetCatalog" set to 
                                                                           # "foo/bar/baz/ibm-zcon-zosconnect-example" at the
                                                                           # destination registry localhost:5000/abc
  sourceType: grpc

WHAT

Refer engineering notes document https://docs.google.com/document/d/1zZ6FVtgmruAeBoUwt4t_FoZH2KEm46fPitUB23ifboY/edit#heading=h.6pw5r5w2r82  steps 2-7

 

Acceptance Criteria

  • Code clean up and formating into functions
  • Ensure good commenting
  • Implement correct code functionality 
  • Ensure to oci mirrorTomirror functionality works correctly
  • Update unit tests

As IBM, I would like to use oc-mirror with the --use-oci-feature flag and ImageSetConfigs containing OCI-FBC operator catalogs to mirror these catalogs to a connected registry
so that , regarding OCI FBC catalog:

  • all bundles specified in the ImageSetConfig and their related images are mirrored from their source registry to the destination registry
  • and the catalogs are mirrored from the local disk to the destination registry
  • and the ImageContentSourcePolicy and CatalogSource files are generated correctly

and that regarding releases, additional images, helm charts:

  • The images that are selected for mirroring are mirrored to the destination registry using the MirrorToMirror workflow

As an oc-mirror user I want a well documented and intuitive  process
so that I can effectively and efficiently deliver image artifacts in both connected and disconnected installs with no impact on my current workflow

Glossary:

  • OCI-FBC operator catalog: catalog image in oci format saved to disk, referenced with oci://path-to-image
  • registry based operator catalog: catalog image hosted on a container registry.

References:

 

Acceptance criteria:

  • No regression on oc-mirror use cases that are not using OCI-FBC feature
  • mirrorToMirror use case with oci feature flag should be successful when all operator catalogs in ImageSetConfig are OCI-FBC:
    • oc-mirror -c config.yaml docker://remote-registry --use-oci-feature succeeds
    • All release images, helm charts, additional images are mirrored to the remote-registry in an incremental manner (only new images are mirrored based on contents of the storageConfig)
    • All catalogs OCI-FBC, selected bundles and their related images are mirrored to the remote-registry and corresponding catalogSource and ImageSourceContentPolicy generated
    • All registry based catalogs, selected bundles and their related images are mirrored to the remote-registry and corresponding catalogSource and ImageSourceContentPolicy generated
  • mirrorToDisk use case with the oci feature flag is forbidden. The following command should fail:
    • oc-mirror --from=seq_xx_tar docker://remote-registry --use-oci-feature
  • diskToMirror use case with oci feature flag is forbidden. The following command should fail:

Feature Overview

Goals

  • Support OpenShift to be deployed from day-0 on AWS Local Zones
  • Support an existing OpenShift cluster to deploy compute Nodes on AWS Local Zones (day-2)

AWS Local Zones support - feature delivered in phases:

  • Phase 0 (OCPPLAN-9630): Document how to create compute nodes on AWS Local Zones in day-0 (SPLAT-635)
  • Phase 1 ( OCPBU-2): Create edge compute pool to generate MachineSets for node with NoSchedule taints when installing a cluster in existing VPC with AWS Local Zone subnets (SPLAT-636)
  • Phase 2 (OCPBU-351): Installer automates network resources creation on Local Zone based on the edge compute pool (SPLAT-657)

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Epic Goal

  • Admins can create compute pool named `edge` on the AWS platform to setting up Local Zone MachineSets.
  • Admins can select and configure subnets on Local Zones before cluster creation.
  • Ensure the installer allows creating a new machine pool for `edge` workloads
  • Ensure the installer can create the MachineSet with `NoSchedule` taints on edge machine pools.
  • Ensure Local Zone subnets will not be used on `worker` compute pools or control planes.
  • Ensure the Wavelength zone will not be used in any compute pool
  • Ensure the Cluster Network MTU manifest is created when Local Zone subnets are added when installing a cluster in existing VPC

Why is this important?

Scenarios

User Stories

  • As a cluster admin, I want the ability to specify a set of subnets on the AWS
    Local Zone locations to deploy worker nodes, so I can further create custom
    applications to deliver low latency to my end users.
  • As a cluster admin, I would like to create a cluster extending worker nodes to
    the edge of the AWS cloud provider with Local Zones, so I can further create
    custom applications to deliver low latency to my end users.
  • As a cluster admin, I would like to select existing subnets from the local and
    the parent region zones, to install a cluster, so I can manage networks with
    my automation.
  • As a cluster admin, I would like to install OpenShift clusters, extending the
    compute nodes to the Local Zones in my day zero operations without needing to
    set up the network and compute dependencies, so I can speed up the edge adoption
    in my organization using OKD/OCP.

Acceptance Criteria

  • The enhancement must be accepted and merged
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The installer implementation must be merged

Dependencies (internal and external)

  1.  

Previous Work (Optional):

  1. https://issues.redhat.com/browse/SPLAT-635 

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

  • Support deploying OCP in “GCP Service Project” while networks are defined in “GCP Host Project”. 
  • Enable OpenShift IPI Installer to deploy OCP in “GCP Service Project” while networks are defined in “GCP Host Project”
  • “GCP Service Project” is from where the OpenShift installer is fired. 
  • “GCP host project” is the target project where the deployment of the OCP machines are done. 
  • Customer using shared VPC and have a distributed network spanning across the projects. 

Goals

  • As a user, I want to be able to deploy OpenShift on Google Cloud using XPN, where networks and other resources are deployed in a shared "Host Project" while the user bootstrap the installation from a "Sevice Project" so that I can follow Google's architecture best practices 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • Enable OpenShift IPI Installer to deploy OCP to a shared VPC in GCP.
  • The host project is where the VPC and subnets are defined. Those networks are shared to one or more service projects.
  • Objects created by the installer are created in the service project where possible. Firewall rules may be the only exception.
  • Documentation outlines the needed minimal IAM for both the host and service project.

Why is this important?

  • Shared VPC's are a feature of GCP to enable granular separation of duties for organizations that centrally manage networking but delegate other functions and separation of billing. This is used more often in larger organizations where separate teams manage subsets of the cloud infrastructure. Enterprises that use this model would also like to create IPI clusters so that they can leverage the features of IPI. Currently organizations that use Shared VPC's must use UPI and implement the features of IPI themselves. This is repetative engineering of little value to the customer and an increased risk of drift from upstream IPI over time. As new features are built into IPI, organizations must become aware of those changes and implement them themselves instead of getting them "for free" during upgrades.

Scenarios

  1. Deploy cluster(s) into service project(s) on network(s) shared from a host project.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a developer, I want to be able to:

  • specify a project for the public and private DNS managedZones

so that I can achieve

  • enable DNS zones in alternate projects, such as the GCP XPN Host Project

Acceptance Criteria:

Description of criteria:

  • cluster-ingress-operator can parse the project and zone name from the following format
    • projects/project-id/managedZones/zoneid
  • cluster-ingress-operator continues to accept names that are not relative resource names
    • zoneid

(optional) Out of Scope:

All modifications to the openshift-installer is handled in other cards in the epic.

Engineering Details:

Feature Overview

Allow users to interactively adjust the network configuration for a host after booting the agent ISO.

Goals

Configure network after host boots

The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.

Epic Goal

  • Allow users to interactively adjust the network configuration for a host after booting the agent ISO, before starting processes that pull container images.

Why is this important?

  • Configuring the network prior to booting a host is difficult and error-prone. Not only is the nmstate syntax fairly arcane, but the advent of 'predictable' interface names means that interfaces retain the same name across reboots but it is nearly impossible to predict what they will be. Applying configuration to the correct hosts requires correct knowledge and input of MAC addresses. All of these present opportunities for things to go wrong, and when they do the user is forced to return to the beginning of the process and generate a new ISO, then boot all of the hosts in the cluster with it again.

Scenarios

  1. The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.
  2. The user has Static IPs, VLANs, and/or bonds to configure, but makes an error entering the configuration in agent-config.yaml so that (at least) one host will not be able to pull container images from the release payload. They correct the configuration for that host via the text console before proceeding with the installation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In the console service from AGENT-453, check whether we are able to pull the release image, and display this information to the user before prompting to run nmtui.

If we can access the image, then exit the service if there is no user input after some timeout, to allow the installation to proceed in the automation flow.

Enhance the openshift-install agent create image command so that the agent-nmtui executable will be embedded in the agent ISO

After having created the agent ISO, the agent-nmtui must be added to the ISO using the following approach:
1. Unpack the agent ISO in a temporary folder
2. Unpack the /images/ignition.img compressed cpio archive in a temporary folder
3. Create a new ignition.img compressed cpio archive by appending the agent-nmtui
2. Create a new agent ISO with the updated ignition.img

Implementation note
Portions of code from a PoC located at https://github.com/andfasano/gasoline could be re-used

When running the openshift-install agent create image command, first of all it needs to extract the agent-tui executable from the release payload in a temporary folder

When the agent-tui is shown during the initial host boot, if the pull release image check fails then an additional checks box is shown along with a details text view.
The content of the details view gets continuosly updated with the details of failed check, but the user cannot move the focus over the details box (using the arrow/tab keys), thus cannot scroll its content (using the up/down arrow keys)

Create a systemd service that runs at startup prior to the login prompt and takes over the console. This should start after the network-online target, and block the login prompt appearing until it exits.

This should also block, at least temporarily, any services that require pulling an image from the registry (i.e. agent + assisted-service).

Right now all the connectivity checks are executed simultaneously, and it doesn't seem necessary especially in the positive scenario, ie when the release image can be pulled without any issue.

So, the connectivity related checks should be performed only when the release image is not accessible, to provide further infos to to the user.

The initial condition for allowing to continue (or not) the installation should be related then just to result of the primary check (right now, just the pull image) and not the secondary ones (http/dns/ping), that are just informative checks.

Note: this approach will also help to manage those cases where, currently, the release image can be pulled but the host doesn't answer to the ping

The openshift-install agent create image will need to fetch the agent-tui executable so that it could be embedded within the agent ISO. For this reason the agent-tui must be available in the release payload, so that it could be retrieved even when the command is invoked in a disconnected environment.

Currently the agent-tui displays always the additional checks (nslookup/ping/http get), even when the primary check (pull image) passes. This may cause some confusion to the user, due the fact that the additional checks do not prevent the agent-tui to complete successfully but they are just informative, to allow a better troubleshooting of the issue (so not needed in the positive case).

The additional checks should then be shown only when the primary check fails for any reason.

As a user, I need information about common misconfigurations that may be preventing the automated installation from proceeding.

If we are unable to access the release image from the registry, provide sufficient debugging information to the user to pinpoint the problem. Check for:

  • DNS
  • ping
  • HTTP
  • Registry login
  • Release image

The node zero ip is currently hard-coded inside set-node-zero.sh.template and in the ServiceBaseURL template string.

ServiceBaseURL is also hard-coded inside:

  • apply-host-config.service.template
  • create-cluster-and-infraenv-service.template
  • common.sh.template
  • start-agent.sh.template
  • start-cluster-installation.sh.template
  • assisted-service.env.template

We need to remove this hard-coding and to allow a user to be able to set the node zero ip through the tui and have it be reflected by the agent services and scripts.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • The goal of this epic to begin the process of expanding support of OpenShift on ppc64le hardware to include IPI deployments against the IBM Power Virtual Server (PowerVS) APIs.

Why is this important?

The goal of this initiative to help boost adoption of OpenShift on ppc64le. This can be further broken down into several key objectives.

  • For IBM, furthering adopt of OpenShift will continue to drive adoption on their power hardware. In parallel, this can be used for existing customers to migrate their old power on-prem workloads to a cloud environment.
  • For the Multi-Arch team, this represents our first opportunity to develop an IPI offering on one of the IBM platforms. Right now, we depend on IPI on libvirt to cover our CI needs; however, this is not a supported platform for customers. PowerVS would address this caveat for ppc64le.
  • By bringing in PowerVS, we can provide customers with the easiest possible experience to deploy and test workloads on IBM architectures.
  • Customers already have UPI methods to solve their OpenShift on prem needs for ppc64le. This gives them an opportunity for a cloud based option, further our hybrid-cloud story.

Scenarios

  • As a user with a valid PowerVS account, I would like to provide those credentials to the OpenShift installer and get a full cluster up on IPI.

Technical Specifications

Some of the technical specifications have been laid out in MULTIARCH-75.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. Images are built in the RHCOS pipeline and pushed in the OVA format to the IBM Cloud.
  2. Installer is extended to support PowerVS as a new platform.
  3. Machine and cluster APIs are updated to support PowerVS.
  4. A terraform provider is developed against the PowerVS APIs.
  5. A load balancing strategy is determined and made available.
  6. Networking details are sorted out.

Open questions::

  1. Load balancing implementation?
  2. Networking strategy given the lack of virtual network APIs in PowerVS.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Recently, the image registry team decided with this change[1] that major cloud platforms cannot have `emptyDir` as the storage backend. IBMCloud uses ibmcos, which we would ideally need to do. There have been few issues identified with using ibmcos as is in the cluster image registry operator and some solutions identified here[2]. Basically, we would need the PowerVS platform to be supported for ibmcos and an API related to change to add resourceGroup in the infra API. This only affects 4.13 and is not an issue for 4.12.

 

[1] https://github.com/openshift/cluster-image-registry-operator/pull/820

[2] https://coreos.slack.com/archives/CFFJUNP6C/p1672910113386879?thread_ts=1672762737.174679&cid=CFFJUNP6C

BU Priority Overview

As our customers create more and more clusters, it will become vital for us to help them support their fleet of clusters. Currently, our users have to use a different interface(ACM UI) in order to manage their fleet of clusters. Our goal is to provide our users with a single interface for managing a fleet of clusters to deep diving into a single cluster.  This means going to a single URL – your Hub – to interact with your OCP fleet.

Goals

The goal of this tech preview update is to improve the experience from the last round of tech preview. The following items will be improved:

  1. Improved Cluster Picker: Moved to Masthead for better usability, filter/search
  2. Support for Metrics: Metrics are now visualized from Spoke Clusters
  3. Avoid UI Mismatch: Dynamic Plugins from Spoke Clusters are disabled 
  4. Console URLs Enhanced: Cluster Name Add to URL for Quick Links
  5. Security Improvements: Backend Proxy and Auth updates

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

As a developer I want a github pr template that allows me to provide:

  1. functionality explanation
  2. assignee
  3. screenshots or demo
  4. draft test cases

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

Installed operators, operator details, operand details, and operand create pages should work as expected in a multicluster environment when copied CSVs are disabled on any cluster in the fleet.

AC:

  • Console backend consumes "copiedCSVsDisabled" flags for each cluster in the fleet
  • Frontend handles copiedCSVsDisabled behavior "per-cluster" and OLM pages work as expected no matter which cluster is selected

In order for hub cluster console OLM screens to behave as expected in a multicluster environment, we need to gather "copiedCSVsDisabled" flags from managed clusters so that the console backend/frontend can consume this information.

AC:

  • The console operator syncs "copiedCSVsDisabled" flags from managed clusters into the hub cluster managed cluster config.

Mock a multicluster environment in our CI using Cypress, without provisioning multiple clusters using a combination of cy.intercept and updating window.SERVER flags in the before section of the test scenarios.

Acceptance Criteria:
Without provisioning additional clusters:

  1. mock server flags to render a cluster dropdown
  2. mock sample pod data for a fictional cluster

Description of problem:

When viewing a resource that exists for multiple clusters, the data may be from the wrong cluster for a short time after switching clusters using the multicluster switcher.

Version-Release number of selected component (if applicable):

4.10.6

How reproducible:

Always

Steps to Reproduce:

1. Install RHACM 2.5 on OCP 4.10 and enable the FeatureGate to get multicluster switching
2. From the local-cluster perspective, view a resource that would exist on all clusters, like /k8s/cluster/config.openshift.io~v1~Infrastructure/cluster/yaml
3. Switch to a different cluster in the cluster switcher 

Actual results:

Content for resource may start out correct, but then switch back to the local-cluster version before switching to the correct cluster several moments later.

Expected results:

Content should always be shown from the selected cluster.

Additional info:

Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2075657

Description of problem:

When multi-cluster is enabled and the console can display data from other clusters, we should either change or disable how we filter the OperatorHub catalog by arch / OS. We assume that the arch and OS of the pod running the console is the same as the cluster, but for managed clusters, it could be something else, which would cause us to incorrectly filter operators.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Migrated from https://bugzilla.redhat.com/show_bug.cgi?id=2089939

 

Description of problem:

There is a possible race condition in the console operator where the managed cluster config gets updated after the console deployment and doesn't trigger a rollout. 

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Rarely

Steps to Reproduce:

1. Enable multicluster tech preview by adding TechPreviewNoUpgrade featureSet to FeatureGate config. (NOTE THIS ACTION IS IRREVERSIBLE AND WILL MAKE THE CLUSTER UNUPGRADEABLE AND UNSUPPORTED) 
2. Install ACM 2.5+
3. Import a managed cluster using either the ACM console or the CLI
4. Once that managed cluster is showing in the cluster dropdown, import a second managed cluster 

Actual results:

Sometimes the second managed cluster will never show up in the cluster dropdown

Expected results:

The second managed cluster eventually shows up in the cluster dropdown after a page refresh

Additional info:

Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2055415

As a Dynamic Plugin developer I would render version of my Dynamic plugin in the About modal. For that we would need to check the `
LoadedDynamicPluginInfo` instances. There we need to check the `metadata.name` and `metadata.version` that we need to surface to the About modal.
 
AC: Render name and version for each Dynamic Plugin into the About modal.

 

Original description: When ACM moved to the unified console experience, we lost the ability in our standalone console to display our version information in our own About modal.  We would like to be able to add our product and version information into the OCP About modal.

Feature Overview

Allow to configure compute and control plane nodes on across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.

Goals

I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and compute nodes across multiple subnets.

I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.

Background, and strategic fit

Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.

Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.

Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.

While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.

Functionalities per Epic

 

Epic Control Plane with Multiple Subnets  Compute with Multiple Subnets Doesn't need external LB Built-in LB
NE-1069 (all-platforms)
NE-905 (all-platforms)
NE-1086 (vSphere)
NE-1087 (Bare Metal)
OSASINFRA-2999 (OSP)  
SPLAT-860 (vSphere)
NE-905 (all platforms)
OPNET-133 (vSphere/Bare Metal for AI/ZTP)
OSASINFRA-2087 (OSP)
KNIDEPLOY-4421 (Bare Metal workaround)
SPLAT-409 (vSphere)

Previous Work

Workers on separate subnets with IPI documentation

We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow

This procedure works on vSphere too, albeit no QE CI and not documented.

External load balancer with IPI documentation

  1. Bare Metal: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-post-installation-configuration.html#nw-osp-configuring-external-load-balancer_ipi-install-post-installation-configuration
  2. vSphere: https://docs.openshift.com/container-platform/4.11/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#nw-osp-configuring-external-load-balancer_installing-vsphere-installer-provisioned

Scenarios

  1. vSphere: I can define 3 or more networks in vSphere and distribute my masters and workers across them. I can configure an external load balancer for the VIPs.
  2. Bare metal: I can configure the IPI installer and the agent-based installer to place my control plane nodes and compute nodes on 3 or more subnets at installation time. I can configure an external load balancer for the VIPs.

Acceptance Criteria

  • Can place compute nodes on multiple subnets with IPI installations
  • Can place control plane nodes on multiple subnets with IPI installations
  • Can configure external load balancers for clusters deployed with IPI with control plane and compute nodes on multiple subnets
  • Can configure VIPs to in external load balancer routed to nodes on separate subnets and VLANs
  • Documentation exists for all the above cases

 

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)
  • For Tech Preview, we won't require Fixed IPs. This is something targeted for 4.14.

Dependencies (internal and external)

  1. For GA, we'll need Fixed IPs, already WIP by vsphere: https://issues.redhat.com/browse/OCPBU-179

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As an OpenShift installation admin I want to use the Assisted Installer, ZTP and IPI installation workflows to deploy a cluster that has remote worker nodes in subnets different from the local subnet, while my VIPs with the built-in load balancing services (haproxy/keepalived).

While this request is most common with OpenShift on bare metal, any platform using the ingress operator will benefit from this enhancement.

Customers using platform none run external load balancers and they won't need this, this is specific for platforms deployed via AI, ZTP and IPI.

Why is this important?

Customers and partners want to install remote worker nodes on day1. Due to the built-in network services we provide with Assisted Installer, ZTP and IPI that manage the VIP for ingress, we need to ensure that they remain in the local subnet where the VIPs are configured.

Previous Work

The bare metal IPI tam added a workflow that allows to place the VIPs in the masters. While this isn't an ideal solution, this is the only option documented:

Configuring network components to run on the control plane

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Why is this important?

OVN IC will be the model used in Hypershift. 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Why is this important?

OVN IC will be the model used in Hypershift. 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.26
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

Description of problem:

The Azure cloud controller manager is currently on kubernetes 1.25 dependencies and should be updated to 1.26

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.13 release, dependencies need to be updated to 1.26. This should be done by rebasing/updating as appropriate for the repository

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that WMCO uses to v1.26 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • Uncover any possible issues with the openshift/kubernetes rebase before it merges.
  • WMCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
  • WMCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

  • All stories in this epic must be completed.
  • Go version is upgraded for WMCO and WMCB components.
  • sdn-4.13 branch is created by the SDN team for using latest kube-proxy component.
  • CI is running successfully with the upgraded components against the 4.13/master branch.
  • Windows nodes must use the same kubelet version as the linux nodes in the cluster.

Dependencies (internal and external)

  1. ART team creating the go 1.20 image for upgrade to go 1.20.
  2. OpenShift/kubernetes repository downstream rebase PR merge.
  3. SDN team for creating the new sdn-4.13 branch.

Open questions::

  1. Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

  • Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
  • Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
  • Step 3 - Upgrade OpenShift client and API dependencies
  • Step 4 - Update kubelet and kube-proxy submodules in WMCO repository
  • Step 5 - Engage SDN team to create new branch for kube-proxy submodule (can be done in parallel with above steps)
  • Step 6 - CI is running successfully with the upgraded components and libraries against the master branch.

User or Developer story

As a WMCO developer, I want the kube-proxy submodule to be pointing to the sdn-4.13-kubernetes-1.26.0 on the openshift/kuberenetes repository so we can pickup the latest kube rebase updates.

Engineering Details

  • Update the submodules using hack/submodule.sh script

Acceptance Criteria

  • WMCO submodule for kube-proxy should pickup the latest updates for 1.26 rebase.
  • Replace deprecated klog flags like --log-dir and --logstostderr in kube-proxy service command with kube-log-runner options.
  • Update must-gather log collection script

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

This includes ibm-vpc-node-label-updater!

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator 
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • manila-csi-driver-operator
  • ovirt-csi-driver-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator

 

  • cluster-storage-operator
  • csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

 

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

  • external-attacher
  • external-provisioner
  • external-resizer
  • external-snapshotter
  • node-driver-registrar
  • livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in  go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Feature Overview

Run must-gather in its own name space.

Goals

Expose --run-namespace option

Requirements

A user can explicitly specify a namespace where a must-gather pod can run. E.g. to avoid adding security constraints to a temporarily created namespace.

We have a RFE where customer is asking option to run must-gather in its own name space . 

Looks like we already have they option but its hidden . The request from customer is un hide the option 

https://github.com/openshift/oc/blob/master/pkg/cli/admin/mustgather/mustgather.go#L158-L160 

 

https://github.com/openshift/oc/pull/1080 

Feature Overview

  • Azure is sunsetting the Azure Active Directory Graph API on June 2022. The OpenShift installer and the in-cluster cloud-credential-operator (CCO) make use of this API. The replacement api is the Microsoft Graph API. Microsoft has not committed to providing a production-ready Golang SDK for the new Microsoft Graph API before June 2022.

Goals

  • Replace the existing AD Graph API for Azure we use for the Installer and Cluster components with the new Microsoft Authentication Library and Microsoft Graph API

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

This description is based on the Google Doc by Rafael Fonseca dos Santos : https://docs.google.com/document/d/1yQt8sbknSmF_hriHyMAKPiztSoRIvntSX9i1wtObSYs

 

Microsoft is deprecating two APIs. The AD Graph API used by Installer destroy code and also used by the CCO to mint credentials. ADAL is also going EOL. ADAL is used by the installer and all cluster components that authenticate to Azure:

Azure Active Directory Authentication Library (ADAL) Retirement **  

ADAL end-of-life is December 31, 2022. While ADAL apps may continue to work, no support or security fixes will be provided past end-of-life. In addition, there are no planned ADAL releases planned prior to end-of-life for features or planned support for new platform versions. We recommend prioritizing migration to Microsoft Authentication Library (MSAL). 

Azure AD Graph API  

Azure AD Graph will continue to function until June 30, 2023. This will be three years after the initial deprecation[ announcement.|https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/update-your-applications-to-use-microsoft-authentication-library/ba-p/1257363] Based on Azure deprecation[ guidelines|https://docs.microsoft.com/en-us/lifecycle/], we reserve the right to retire Azure AD Graph at any time after June 30, 2023, without advance notice. Though we reserve the right to turn it off after June 30, 2023, we want to ensure all customers migrate off and discourage applications from taking production dependencies on Azure AD Graph. Investments in new features and functionalities will only be made in[ Microsoft Graph|https://docs.microsoft.com/en-us/graph/overview]. Going forward, we will continue to support Azure AD Graph with security-related fixes. We recommend prioritizing migration to Microsoft Graph.

https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-september-2022-train/ba-p/2967454

https://learn.microsoft.com/en-us/answers/questions/768833/when-is-adal-and-azure-ad-graph-reaching-end-of-li.html

Takeaways / considerations

  • The new Microsoft Authentication Library (MSAL) that we will migrate to requires a new API permission: Graph API ReadWrite.OwnedBy (relevant [slack thread|https://coreos.slack.com/archives/C68TNFWA2/p1644009342019649?thread_ts=1644008944.461989&cid=C68TNFWA2)]. The old ReadWrite.OwnedBy API permissions could be removed to test as well.
  • Mint mode was discontinued in Azure, but clusters may exist that have cluster-created service principals from before the retirement. In that case, the service principals will either need to be deleted manually or with a newer version of the installer that has support for MSAL.
  • Migration to the new API (see Migration Guide below) entails using the azidentity package. The azidentity package is intended for use with V2 versions of the azure sdk for go, an adapter is required if the SDK packages have not been upgraded to V2, which is the case for our codebase. Only recently have V2 packages become stable. See references below.
  • Furthermore, azidentity is tied to Go 1.18, which affects our ability to backport prior to 4.11 or earlier versions.
  • Another consideration for backporting is that ADAL is used by the in-tree Azure cloud provider. These legacy cloud providers are generally closed for development, so an upstream patch seems unlikely, as does carrying a patch.
  • A path forward for the Azure cloud provider must be determined. Due to the legacy cloud providers freeze mentioned prior to this, it seems that the best path forward is for the out-of-tree provider and CCM, scheduled for 4.14: OCPCLOUD-1128, but even the upstream out-of-tree provider has not migrated yet: https://github.com/kubernetes-sigs/cloud-provider-azure/issues/430
  • AD FS (Active Directory Federation Services) are not yet supported in the Azure SDK for Go: https://github.com/AzureAD/microsoft-authentication-library-for-go/issues/31. There is a very limited user base for AD FS, but exactly how many users is unknown at this moment. Switching to the new API would break these users, so the best approach known at this moment would be to advise this extremely limited number of users to maintain the last supported version of OpenShift that uses ADAL until Microsoft introduces AD FS support. We do not document support for AD FS.

 

References:

Feature Overview
Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format at some point in a future Kubernetes release. OpenShift needs to be ready to action this change 

Goals

  • Common plugin framework to aid development of out of tree cloud providers
  • Out of tree providers for AWS, Azure, GCP, vSphere, etc
  • Possible certification process for 3rd Party out of tree cloud providers

Requirements

Requirement Notes isMvp?
Plugin framework   Yes
AWS out of tree provider   Yes 
Other Cloud provider plugins    No
     

Out of Scope

n/a

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations

  • Target audience: cluster admins
  • Updated content: update docs to clearly show how to install and use the new providers.

Epic Goal

  • Implement an out of tree cloud provider for VMware

Why is this important?

  • The community is moving to out of tree cloud providers, we need to get ahead of this trend so we are ready when the switch over occurs for this functionality

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

To make the CCM GA, we need to update the switch case in library go to make sure the vSphere CCM is always considered external.

We then need to update the vendor in KCMO, CCMO, KASO and MCO.

Steps

  • Create a PR for updating library go
  • Create PRs for updating the vendor in dependent repos
  • Leverage an engineer with merge right (eg David Eads) to merge the library go, KCMO and CCMO changes simultaneously
  • Merge KASO and MCO changes

Stakeholders

  • Cluster Infra
  • SPLAT

Definition of Done

  • vSphere CCM is enabled by default
  • Docs
  • N/A
  • Testing
  • <Explain testing that will be added>

Background

The vSphere CCM has a new YAML based cloud config format. We should build a config transformer into CCMO to load the old ini file, drop any storage/unrelated entries, and convert the existing schema to the new YAML schema, before storing it within the CCM namespace.

This will allow us to use the new features from the YAML config and avoid the old, deprecated ini format.

Steps

  • Sync up with SPLAT and make sure this is the right way to go,
  • Make sure not to introduce dependency on vSphere provider itself
  • Evaluate existing configuration and new configuration and plan transformation.
  • Implement transformer to transform ini to yaml
  • Ensure old storage configuration is dropped

Stakeholders

  • Cluster Infra
  • SPLAT

Definition of Done

  • Configuration for the vSphere CCM in the cloud controller manager namespace is in the new YAML format
  • Docs
  • N/A
  • Testing
  • Make sure to test the conversion
  • What happens if the existing config is YAML not ini

Feature Overview

  • As an infrastructure owner, I want a repeatable method to quickly deploy the initial OpenShift cluster.
  • As an infrastructure owner, I want to install the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters.

Goals

  • Enable customers and partners to successfully deploy a single “first” cluster in disconnected, on-premises settings

Requirements

4.11 MVP Requirements

  • Customers and partners needs to be able to download the installer
  • Enable customers and partners to deploy a single “first” cluster (cluster 0) using single node, compact, or highly available topologies in disconnected, on-premises settings
  • Installer must support advanced network settings such as static IP assignments, VLANs and NIC bonding for on-premises metal use cases, as well as DHCP and PXE provisioning environments.
  • Installer needs to support automation, including integration with third-party deployment tools, as well as user-driven deployments.
  • In the MVP automation has higher priority than interactive, user-driven deployments.
  • For bare metal deployments, we cannot assume that users will provide us the credentials to manage hosts via their BMCs.
  • Installer should prioritize support for platforms None, baremetal, and VMware.
  • The installer will focus on a single version of OpenShift, and a different build artifact will be produced for each different version.
  • The installer must not depend on a connected registry; however, the installer can optionally use a previously mirrored registry within the disconnected environment.

Use Cases

  • As a Telco partner engineer (Site Engineer, Specialist, Field Engineer), I want to deploy an OpenShift cluster in production with limited or no additional hardware and don’t intend to deploy more OpenShift clusters [Isolated edge experience].
  • As a Enterprise infrastructure owner, I want to manage the lifecycle of multiple clusters in 1 or more sites by first installing the first  (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters [Cluster before your cluster].
  • As a Partner, I want to package OpenShift for large scale and/or distributed topology with my own software and/or hardware solution.
  • As a large enterprise customer or Service Provider, I want to install a “HyperShift Tugboat” OpenShift cluster in order to offer a hosted OpenShift control plane at scale to my consumers (DevOps Engineers, tenants) that allows for fleet-level provisioning for low CAPEX and OPEX, much like AKS or GKE [Hypershift].
  • As a new, novice to intermediate user (Enterprise Admin/Consumer, Telco Partner integrator, RH Solution Architect), I want to quickly deploy a small OpenShift cluster for Poc/Demo/Research purposes.

Questions to answer…

  •  

Out of Scope

Out of scope use cases (that are part of the Kubeframe/factory project):

  • As a Partner (OEMs, ISVs), I want to install and pre-configure OpenShift with my hardware/software in my disconnected factory, while allowing further (minimal) reconfiguration of a subset of capabilities later at a different site by different set of users (end customer) [Embedded OpenShift].
  • As an Infrastructure Admin at an Enterprise customer with multiple remote sites, I want to pre-provision OpenShift centrally prior to shipping and activating the clusters in remote sites.

Background, and strategic fit

  • This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  1. The user has only access to the target nodes that will form the cluster and will boot them with the image presented locally via a USB stick. This scenario is common in sites with restricted access such as government infra where only users with security clearance can interact with the installation, where software is allowed to enter in the premises (in a USB, DVD, SD card, etc.) but never allowed to come back out. Users can't enter supporting devices such as laptops or phones.
  2. The user has access to the target nodes remotely to their BMCs (e.g. iDrac, iLo) and can map an image as virtual media from their computer. This scenario is common in data centers where the customer provides network access to the BMCs of the target nodes.
  3. We cannot assume that we will have access to a computer to run an installer or installer helper software.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

 

 

Epic Goal

Why is this important?

  • The Agent Based Installer is a new install path targeting fully disconnected installs. We should be looking at adding support for ARM in all install paths to ensure our customers can deploy to disconnected environments.
  • We want to start having new projects/products launch with support for ARM by default.

Scenarios
1. …

Acceptance Criteria

  • The Agent Installer launches with aarch64 support
  • The Agent installer has QE completed & CI for aarch64

Dependencies (internal and external)
1. …

Previous Work (Optional):
1.https://issues.redhat.com/browse/ARMOCP-346 

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

As an OCP admistrator, I would like to deploy OCP on arm64 BM with agent installer

Acceptance Criteria

Dev:

  • Ensure openshift-installer creates an arm64 agent.iso
  • Ensure openshift-installer creates the correct ignition config and supporting files for assisted-api
  • Ensure assisted-api can install 

Jira Admin

  • Additional Jira tickets created (if needed)

QE

  • Understand if QE is needed for agent installer (as this Epic is currently a TP)

Docs:

  • Understand if ARM documentation needs to be updated (as there is currently no x86 documentation)

Agent Installer

  • Investigate if Heterogeneous clusters are feasible for Agent Installer
  1. Proposed title of this feature request:

Update ETCD datastore encryption to use AES-GCM instead of AES-CBC

2. What is the nature and description of the request?

The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack.  Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.

The cipher used is hard coded. 

3. Why is this needed? (List the business requirements here).

Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production. 

4. List any affected packages or components.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html

 

Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type. 

 

RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html

 

 
Why is this important? (mandatory)

AES-CBC is considered as a weak cipher

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

AES-GCM encryption was enabled in cluster-openshift-apiserver-operator and cluster-openshift-autenthication-operator, but not in the cluster-kube-apiserver-operator. When trying to enable aesgcm encryption in the apiserver config, the kas-operator will produce an error saying that the aesgcm provider is not supported.

Feature Overview

  • Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms

Goals

  • Extend the existing features while deploying OpenShift on IBM Cloud

Background, and strategic fit

This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.

 

Epic Goal

With this BYON support:

  • shared resources (VPC, subnets) can be placed in the resource group specified by the `networkResourceGroupName` install config parameter.
  • installer provisioned cluster resources will be placed in the resource group specified by the `resourceGroupName` install config parameter.

 

  • `networkResourceGroupName` is a required parameter for the BYON scenario
  • `resourceGroupName` is an optional parameter

Why is this important?

  • This will allow customers (using IBM Cloud VPC BYON support) to organize pre-created / shared resources (VPC, subnets) in a resource group separate from installer provisioned cluster resources.

Scenarios

`networkResourceGroupName` NOT specified ==> non-BYON install scenario

  • if `resourceGroupName` is specified, then ALL installer provisioned resources (VPC, subnets, cluster) will be placed in specified resource group (resource group must exist)
  • if `resourceGroupName` is NOT specified, then ALL installer provisioned resources (VPC, subnets, cluster) will be placed in a resource group created during the install process

`networkResourceGroupName` specified ==> BYON install scenario (required for BYON scenario)

  • `networkResourceGroupName` must contain pre-created/shared resources (VPC, subnets)
  • if `resourceGroupName` is specified, then all installer provisioned cluster resources will be placed in specified resource group (resource group must exist)
  • if `resourceGroupName` is NOT specified, then all installer provisioned cluster resources will be placed in a resource group created during the install process

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add support for a NetworkResourceGroup in the MachineProviderSpec, and the logic for performing lookups during machine creation for IBM Cloud.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal

  • Make machine phases public so they can be used by controller packages.

Why is this important?

  • Recent IBM Cloud VPC provisioning fix in MAPI code manipulated the machine phases. The MAO phase constants were dup'd in the MAPI code since we wanted to minimize blast radius of provisioning fix. Making the phases public in MAO would be a cleaner approach and allow MAPI to use them directly (and not duplicate).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Open questions::

  1. ?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

 

See slack working group: #wg-ctrl-plane-resize

Epic Goal

  • To add an E2E suite of presubmit and periodic tests for the ControlPlaneMachineSet project
  • To improve the integration tests within the ControlPlaneMachineSet repository to cover cases we aren't testing in E2E

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

Even if a machine does not need an update, the CPMS should replace it when it's been marked for deletion.

Steps

  • Create an integration test that:
    • Creates a CPMS and 3 control plane machines
    • Add finalizers to the machines
    • Ensure the CPMS status is as expected
    • Delete one of the machines
    • Check the CPMS creates a replacement

Stakeholders

  • Cluster Infra

Definition of Done

  • Integration test runs in the CPMS controllers package
  • Docs
  • N/A
  • Testing
  • N/A

Background

We test that we can replace work machines behind a proxy configuration, but we do not test control plane machines.

It would be good to check that the control plane replacement is not going to be disrupted by the proxy

Motivation

We want to make sure that the latency added by having a proxy between resources does not affect replacing control plane machines

Steps

  • Create a test that:
    • Checks the cluster operators are all stable/waits for them to stabilise
    • Creates a cluster wide proxy
    • Checks the cluster operators are all stable/waits for them to stabilise
    • Modify master-0's spec to cause it to be update
    • Checks that the CPMS creates a new instance
    • Checks naming of the new machine
    • Checks the old machine isn't marked for deletion while the new Machine's phase is not Running
    • Waits until the replacement is complete, ie CPMS status reports replicas == updatedReplicas
    • Waits until cluster operators stabilise again
    • Remove cluster wide proxy
    • Wait until cluster operators stabilise again

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

For clusters where this appropriate, check that a new CPMS is generated and that it is as expected, ie replicas == updatedReplicas, no errors reported

This will need to be tied to OCPCLOUD-1741 which removes the CPMS, we can run them in an ordered container together

Steps

  • Create a test that:
    • Checks the CPMS status of the newly created CPMS is as expected

Stakeholders

  • Cluster Infra

Definition of Done

  • Docs
  • N/A
  • Testing
  • N/A

Background

Remove e2e common test suite from the control plane machine set E2Es, and adapt presubmit and periodic test setups and teardowns to account for this change.

For more context see this [GitHub conversation](https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/147#discussion_r1035912969)

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To validate the deletion process of the CPMS, we need to create a test that deletes the CPMS and checks that the CPMS eventually goes away (it may come back with a different UID), and that when it goes away, there are no owner references on the control plane machines, there are still 3 control plane machines, and the cluster operators are all stable

Motivation

This is already tested with an integration test, but we should also check this in E2E as it is cheap (assuming it works, no machine changes) and may pick up weird interactions with other components.

Eg in integration we have no GC running.

Steps

  • Create a test that:
    • Checks that/waits until all clusteroperators are stable
    • Checks the ControlPlaneMachineSet is as expected
    • Deletes the ControlPlaneMachineSet
    • Waits for the CPMS to be removed/ for the UID to change
    • Checks that (if present) the new CPMS is inactive
    • Checks that all control plane machines are still running
    • Checks that all control plane machines have no owner references
    • Checks that all control plane machines do not have a deletion timestamp
    • Checks that all clusteroperators are stable

Stakeholders

  • Cluster Infra

Definition of Done

  • Test is merged and running as a presubmit and periodic test
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

When Inactive, the CPMS should be updated by the generator controller where applicable.

We should test that we can update the spec of the newest machine, observe the CPMS get updated, and then set it back, and observe the update again.

Steps

  • Create a test that:
    • Checks the CPMS is inactive
    • Modifies the newest, alphabetically last instance to trigger the CPMS to be regenerated
    • Check that the CPMS is regenerated
    • Check that the CPMS reports 1 updated machine (the other two will need update)
    • Reset the Machines spec to original
    • Check that the CPMS is regenerated
    • Check that the CPMS reports replicas == updatedReplicas

Stakeholders

  • Cluster Infra

Definition of Done

  • Docs
  • N/A
  • Testing
  • N/A

Background

We expect that a generated CPMS should be able to be activated without causing a rollout, that is, the replicas should be equal to the updated replicas.

This should run after OCPCLOUD-1742 in an ordered container

Steps

  • Create a test that:
    • Checks the CPMS status is as expected
    • Activates the CPMS
    • Checks that no new machines are created
    • Checks that all cluster operators are stable

Stakeholders

  • Cluster Infra

Definition of Done

  • This is running in an ordered container with the OCPCLOUD-1742 test
  • Docs
  • N/A
  • Testing
  • N/A

Background

We expect once active for the CPMS to own the master machines.

We can check this in tandem with the other activation test OCPCLOUD-1746

Steps

  • Create a test that:
    • Checks, when activated, that a the Control Plane Machines get owner references
    • Check that the Machines do not get garbage collected
    • Check all cluster operators are stable

Stakeholders

  • Cluster Infra

Definition of Done

  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

We want to make sure that if new failure domains are added, the CPMS rebalances the machines.

This should be automatic with the RollingUpdate strategy.

Steps

  • Create an integration test that:
    • Creates an Inactive CPMS and 3 machines across 2 failure domains
    • Check the CPMS thinks one machine needs an update
    • Activate the CPMS
    • Check the CPMS creates a new Machine and deletes an old one
    • The CPMS should now report all machines are up to date.

Stakeholders

  • Cluster Infra

Definition of Done

  • Integration test is running in CPMS controllers package
  • Docs
  • N/A
  • Testing
  • N/A

Epic Goal

  • Add tech-preview support to install OpenShift on OpenStack with multiple failure domains.

Why is this important?

  • Multiple (large) customers are requesting this configuration and installation type to provide a higher level of high availability.
  • This could be a blocker for https://issues.redhat.com/browse/OSASINFRA-2999 even though technically that feature should be possible without this

Scenarios

  1. Spread the control plane across 3 domains (each domain has a defined storage / network / compute configuration)
  2. Indirectly we'll inherit from the features proposed by https://issues.redhat.com/browse/OCPCLOUD-1372
    1. Automatically add extra node(s) to the control plane
    2. Remove node(s) from the control plane
    3. Recover from a lost node incident

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/OCPCLOUD-1372 to be finished and delivered in 4.12

Previous Work (Optional):

  1. https://issues.redhat.com/browse/OSASINFRA-2997

Open questions::

none for now.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Enhancement - https://github.com/openshift/enhancements/pull/1167
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

This is for backporting the feature to 4.13 past Feature freeze, with the approval of Program management.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

1. Proposed title of this feature request
BYOK encrypts root vols AND default storageclass

2. What is the nature and description of the request?
User story
As a customer spinning up managed OpenShift clusters, if I pass a custom AWS KMS key to the installer, I expect it (installer and cluster-storage-operator) to not only encrypt the root volumes for the nodes in the cluster, but also be applied to encrypt the first/default (gp2 in current case) StorageClass, so that my assumptions around passing a custom key are met.
In current state, if I pass a KMS key to the installer, only root volumes are encrypted with it, and the default AWS managed key is used for the default StorageClass.
Perhaps this could be offered as a flag to set in the installer to further pass the key to the storage class, or not.

3. Why does the customer need this? (List the business requirements here)
To satisfy that customers wish to encrypt their owned volumes with their selected key instead of the AWS default account key, by accident.

4. List any affected packages or components.

  • uncertain.

Note: this implementation should take effect on AWS, GCP and Azure (any cloud provider) equally.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story:

As a cluster admin, I want OCP to provision new volumes with my custom encryption key that I specified during cluster installation in install-config.yaml so all OCP assets (PVs, VMs & their root disks) use the same encryption key.

Acceptance Criteria:

Description of criteria:

  • Check that dynamically provisioned PVs use the key specified in install-config.yaml
  • Check that the key can be changed in TBD API and all volumes newly provisioned after the key change use the new key. (Exact API is not defined yet, probably a new field in `Infrastructure`, calling it TBD API now).

(optional) Out of Scope:

Re-encryption of existing PVs with a new key. Only newly provisioned PVs will use the new key.

Engineering Details:

Enhancement (incl. TBD API with encryption key reference) will be provided as part of https://issues.redhat.com/browse/CORS-2080.

"Raw meat" of this story is translation of the key reference in TBD API to StorageClass.Parameters. Azure Disk CSi driver operator should update both the StorageClass it manages (managed-csi) with:

Parameters:
    diskEncryptionSetID: /subscriptions/<subs-id>/resourceGroups/<rg-name>/providers/Microsoft.Compute/diskEncryptionSets/<diskEncryptionSet-name>

Upstream docs: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/docs/driver-parameters.md (CreateVolume parameters == StorageClass.Parameters)

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story:

As a cluster admin, I want OCP to provision new volumes with my custom encryption key that I specified during cluster installation in install-config.yaml so all OCP assets (PVs, VMs & their root disks) use the same encryption key.

Acceptance Criteria:

Description of criteria:

  • Check that dynamically provisioned PVs use the key specified in install-config.yaml
  • Check that the key can be changed in TBD API and all volumes newly provisioned after the key change use the new key. (Exact API is not defined yet, probably a new field in `Infrastructure`, calling it TBD API now).

(optional) Out of Scope:

Re-encryption of existing PVs with a new key. Only newly provisioned PVs will use the new key.

Engineering Details:

Enhancement (incl. TBD API with encryption key reference) will be provided as part of https://issues.redhat.com/browse/CORS-2080.

"Raw meat" of this story is translation of the key reference in TBD API to StorageClass.Parameters. AWS EBS CSi driver operator should update both the StorageClass it manages (managed-csi) with:

Parameters:
    encrypted: "true"

    kmsKeyId:  "arn:aws:kms:us-east-1:012345678910:key/abcd1234-a123-456a-a12b-a123b4cd56ef"

Upstream docs: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/parameters.md 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story:

As a cluster admin, I want OCP to provision new volumes with my custom encryption key that I specified during cluster installation in install-config.yaml so all OCP assets (PVs, VMs & their root disks) use the same encryption key.

Acceptance Criteria:

Description of criteria:

  • Check that dynamically provisioned PVs use the key specified in install-config.yaml
  • Check that the key can be changed in TBD API and all volumes newly provisioned after the key change use the new key. (Exact API is not defined yet, probably a new field in `Infrastructure`, calling it TBD API now).

(optional) Out of Scope:

Re-encryption of existing PVs with a new key. Only newly provisioned PVs will use the new key.

Engineering Details:

Enhancement (incl. TBD API with encryption key reference) will be provided as part of https://issues.redhat.com/browse/CORS-2080.

"Raw meat" of this story is translation of the key reference in TBD API to StorageClass.Parameters. GCP PD CSi driver operator should update both StorageClasses that it manages (standard-csi, standard-ssd) with:

Parameters:
    disk-encryption-kms-key: projects/<KEY_PROJECT_ID>/locations/<LOCATION>/keyRings/<RING_NAME>/cryptoKeys/<KEY_NAME>

Upstream docs: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#createvolume-parameters (CreateVolume parameters == StorageClass.Parameters)

Epic Goal

  • Improve the default configuration the installer uses when the control-plane is single node

Why is this important?

  • Starting 4.13 we're going to officially support (OCPBU-95) SNO on AWS, so our installer defaults need to make sense

Scenarios

  1. User performs AWS IPI installation with number of control plane node replicas equal to 1. Installer will default instance type to be bigger than it usually would, to align with larger single-node openshift control plane requirements 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

  • Starting with version 4.13 OCP is going to officially support Single
    Node clusters on AWS.
  • The minimum documented OCP requirement for single-node control plane
    nodes is 8-cores and 16GiB of RAM
  • The current default instance type chosen for AWS clusters by the
    installer is `xlarge` which is 4 cores and 16GiB of RAM

Issue

The default instance type the installer currently chooses for Single
Node Openshift clusters doesn't follow our documented minimum
requirements

Solution

When the number of replicas of the ControlPlane pool is 1, the installer
will now choose `2xlarge` instead of `xlarge`.

Caveat

`2xlarge` has 32GiB of RAM, which is twice as much as we need, but it's
the best we can do to meet the minimum single-node requirements, because
AWS doesn't offer a 16GiB RAM instance type with 8 cores.

 

Feature Overview (aka. Goal Summary)  

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

Goals (aka. expected user outcomes)

To enable full support for control plane machine sets on GCP

 

Requirements (aka. Acceptance Criteria):

  • Generate CPMS for upgraded clusters
  • Document support for upgraded clusters
  • Ensure E2E testing for GCP clusters

Out of Scope

Any other cloud platforms

Background

Feature created from split of overarching Control Plane Machine Set feature into single release based effort

 

Customer Considerations

n/a

 

Documentation Considerations

Nothing outside documentation that shows the Azure platform is supported as part of Control Plane Machine Sets

 

Interoperability Considerations

n/a

Goal:

Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem:

There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important:

  • Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

Lifecycle Information:

  • Core

Previous Work:

Dependencies:

  • Etcd operator

Prioritized epics + deliverables (in scope / not in scope):

Estimate (XS, S, M, L, XL, XXL):

 

 

 

User Story:

As a developer, I want to be able to:

  • Create Azure control plane nodes using MachineSets.

so that I can achieve

  • More control over the nodes using the MachineAPI Operator.

Acceptance Criteria:

Description of criteria:

  • New CRD ControlPlaneMachineSet is used and populated.
  • New manifest is created for the ControlPlaneMachineSet.
  • Fields required for the CRD are set.

(optional) Out of Scope:

 

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

Epic Goal

  • Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
  • These migrations would include, but are not limited to:
    • CSI driver for Azure (file and disk)
    • CSI driver for VMware vSphere

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Kubernetes upstream has chosen to allow users to opt-out from CSI volume migration in Kubernetes 1.26 (1.27 PR, 1.26 backport). It is still GA there, but allows opt-out due to non-trivial risk with late CSI driver availability.

We want a similar capability in OCP - a cluster admin should be able to opt-in to CSI migration on vSphere in 4.13. Once they opt-in, they can't opt-out (at least in this epic).

Why is this important? (mandatory)

See an internal OCP doc if / how we should allow a similar opt-in/opt-out in OCP.

 
Scenarios (mandatory) 

Upgrade

  1. Admin upgrades 4.12 -> 4.13 as usual
  2. Storage CR has CSI migration disabled (or nil), in-tree volume plugin handles in-tree PVs.
  3. At the same time, external CCM runs, however, due to kubelet running with –cloud-provider=vsphere, it does not do kubelet’s job.
  1. Admin can opt-in to CSI migration by editing Storage CR. That enables OPENSHIFT_DO_VSPHERE_MIGRATION env. var. everywhere + runs kubelet with –cloud-provider=external.
    1. If we have time, it should not be hard to opt out, just remove the env. var + update kubelet cmdline. Storage / in-tree volume plugin will handle in-tree PVs again, not sure about implications on external CCM.
  2. Once opted-in, it’s not possible to opt out.
  1. Both with opt-in and without it, the cluster is Upgradeable=true. Admin can upgrade to 4.14, CSI migration will be forced there.

 

New install

  1. Admin installs a new 4.13 vSphere cluster, with UPI, IPI, Assisted Installer, or Agent-based Installer.
  2. During installation, Storage CR is created with CSI migration enabled
  3. (We want to have it enabled for a new cluster to enable external CCM and have zonal.  This avoids new clusters from having in-tree as default and then having to go through migration later.)
  4. Resulting cluster has OPENSHIFT_DO_VSPHERE_MIGRATION env. var set + kubelet with –cloud-provider=external + topology support.
  5. Admin cannot opt-out after installation, we expect that they use CSI volumes for everything.
  1. If the admin really wants, they can opt-out before installation by adding a Storage install manifest with CSI migration disabled.

 

EUS to EUS (4.12 -> 4.14)

  • Will have CSI migration enabled once in 4.14
  • During the upgrade, a cluster will have 4.13 masters with CSI migration disabled (see regular upgrade to 4.13 above) + 4.12 kubelets.
  • Once the masters are 4.14, CSI migration is force-enabled there, still, 4.14 KCM + in-tree volume plugin in it will handle in-tree volume attachments required by kubelets that still have 4.12 (that’s what kcm --external-cloud-volume-plugin=vsphere does).
  • Once both masters + kubelets are 4.14, CSI migration is force enabled everywhere, in-tree volume plugin + cloud provider in KCM is still enabled by --external-cloud-volume-plugin, but it’s not used.
  • Keep in-tree storage class by default
  • A CSI storage class is already available since 4.10
  • Recommend to switch default to CSI
  • Can’t opt out from migration
    Dependencies (internal and external) (mandatory)
  • We need a new FeatureSet in openshift/api that disables CSIMigrationvSphere feature gate.
  • We need kube-apiserver-operator, kube-controller-manager-operator, kube-scheduler-operator, MCO must reconfigure their operands to use in-tree vSphere cloud provider when they see CSIMigrationvSphere FeatureGate disabled.
  • We need cloud controller manager operator to disable its operand when it sees CSIMigrationvSphere FeatureGate disabled.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We need to audit/review and implement all pending feature gates that are implemented in upstream CSI driver - https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/master/manifests/vanilla/vsphere-csi-driver.yaml#L151

Some of this stuff although is necessary but could break the driver, so we have to be careful.

On new installations, we should make the StorageClass created by the CSI operator the default one. 

However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set  a different quota on the CSI driver Storage Class.

Exit criteria:

  • New clusters get the CSI Storage Class as the default one.
  • Existing clusters don't get their default Storage Classes changed.

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • The Kernel API was updated for RHEL 9, so the old approach of setting the `sched_domain` in `/sys/kernel` is no longer available. Instead, cgroups have to be worked with directly.
  • Both CRI-O and PAO need to be updated to set the cpuset of containers and other processes correctly, as well as set the correct value for sched_load_balance

Why is this important?

  • CPU load balancing is a vital piece of real time execution for processes that need exclusive access to a CPU. Without this, CPU load balancing won't work on RHEL 9 with Openshift 4.13

Scenarios

  1. As a developer on Openshift, I expect my pods to run with exclusive CPUs if I set the PAO configuration correctly

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Part of setting CPU load balancing on RHEL 9 involves disabling sched_load_balance on cgroups that contain a cpuset that should be exclusive. The PAO may be required to be responsible for this piece

This is the Epic to track the work to add RHCOS 9 in OCP 4.13 and to make OCP use it by default.

 

CURRENT STATUS: Landed in 4.14 and 4.13

 

Testing with layering

 

Another option given an existing e.g. 4.12 cluster is to use layering.  First, get a digested pull spec for the current build:

$ skopeo inspect --format "{{.Name}}@{{.Digest}}" -n docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev:4.13-9.2
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099

Create a MachineConfig that looks like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-override
spec:
  osImageURL: <digested pull spec>

If you want to also override the control plane, create a similar one for the master role.
 
We don't yet have auto-generated release images. However, if you want one, you can ask cluster bot to e.g. "launch https://github.com/openshift/machine-config-operator/pull/3485" with options you want (e.g. "azure" etc.) or just "build https://github.com/openshift/machine-config-operator/pull/3485" to get a release image.

Description:

Upstream OKD/FCOS are already using latest ignition that supports [1] writing authorized keys in /home/core/.ssh/authorized_keys.d/ignition . With RHCOS 9, we should also start using new default path /home/core/.ssh/authorized_keys.d/ignition instead of /home/core/.ssh/authorized_keys

[1]https://github.com/openshift/machine-config-operator/pull/2688

Acceptance Criteria:

  • ssh key gets written into /home/core/.ssh/authorized_keys.d/ignition on RHCOS 9 nodes and /home/co/.ssh/authorized_keys file doesn't exist
  • Upgrade from RHCOS 8 to RHCOS 9 node works as expected and all ssh keys from /home/core/.ssh/authorized_keys gets migrated to /home/core/.ssh/authorized_keys.d/ignition
  • MCO e2e test would have to adapt accordingly as today it is looking for ssh key in /home/core/.ssh/authorized_keys

Epic Goal

  • Users who disable ssh access in favor of `oc debug` are reliant on the OpenShift API being up between the supervisors and worker nodes. In order to troubleshoot or RCA a node problem, these users would like to be able to use password auth on /dev/console, which they can access via BMC or local keyboard.

Why is this important?

  • While setting passwords hasn't been cool in some time, it can make sense if password auth is disabled in sshd (which it is by default).
  • There is a workaround: push an /etc/shadow.

Scenarios

  1. A new node is failing to join the cluster and ssh/api access is not possible but a local console (via cloud provider or bare metal BMC). The administrator would like to pull logs to triage the joining problem.
  2. sshd is not enabled and the API connection to the kubelet is down (so no `oc debug node`) and the administrator needs to triage the problem and/or collect logs.

Acceptance Criteria

  • Users can set and change a password on "core" via ignition (machineconfig).
  • Changing the core user password should not cause workload disruption
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Users who disable ssh access in favor of `oc debug` are reliant on the OpenShift API being up between the supervisors and worker nodes. In order to troubleshoot or RCA a node problem, these users would like to be able to use password auth on /dev/console, which they can access via BMC or local keyboard.

Why is this important?

  • While setting passwords hasn't been cool in some time, it can make sense if password auth is disabled in sshd (which it is by default).
  • There is a workaround: push an /etc/shadow.

Scenarios

  1. A new node is failing to join the cluster and ssh/api access is not possible but a local console (via cloud provider or bare metal BMC). The administrator would like to pull logs to triage the joining problem.
  2. sshd is not enabled and the API connection to the kubelet is down (so no `oc debug node`) and the administrator needs to triage the problem and/or collect logs.

Acceptance Criteria

  • Users can set and change a password on "core" via ignition (machineconfig).
  • Changing the core user password should not cause workload disruption
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Users who disable ssh access in favor of `oc debug` are reliant on the OpenShift API being up between the supervisors and worker nodes. In order to troubleshoot or RCA a node problem, these users would like to be able to use password auth on /dev/console, which they can access via BMC or local keyboard.

Why is this important?

  • While setting passwords hasn't been cool in some time, it can make sense if password auth is disabled in sshd (which it is by default).
  • There is a workaround: push an /etc/shadow.

Scenarios

  1. A new node is failing to join the cluster and ssh/api access is not possible but a local console (via cloud provider or bare metal BMC). The administrator would like to pull logs to triage the joining problem.
  2. sshd is not enabled and the API connection to the kubelet is down (so no `oc debug node`) and the administrator needs to triage the problem and/or collect logs.

Acceptance Criteria

  • Users can set and change a password on "core" via ignition (machineconfig).
  • Changing the core user password should not cause workload disruption
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Feature Overview

  • Kubernetes offers different ways to consume, one could request persistent volumes that survive pod termination or ask for a ephemeral storage space that will be consumed during the lifetime of the pod.
  • This feature tracks the improvements around ephemeral storage as some workloads rely on reliable temporary storage space such as batch jobs, caching services or any app that does not care whether the data is stored persistently across restarts

Goals

 

As described in the kubernetes "ephemeral volumes" documentation this features tracks GA and improvements in

OCPPLAN-9193 Implemented local ephemeral capacity management as well as CSI Generic ephemeral volume. This feature tracks the remaining work to GA CSI ephemeral in-inline volume, specially the admission plugin to make the feature secure and prevent any insecure driver from using it. Ephemeral in-line is required by some CSI as key feature to operate (e.g SecretStore CSI), ODF is also planning to GA ephemeral in-line with ceph CSI. 

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Use Cases

This Section:

  • As an OCP user I want to consume ephemeral storage for my workload
  • As an OCP user I would like to include my PV definition directly in my app definition
  • As an OCP admin I would like to offer ephemeral volumes to my users though CSI
  • As a partner I would like to onboard a driver that relies on CSI inline volumes

Customer Considerations

  • Make sure each ephemeral volume option is clearly identified and documented for each purpose.
  • Make sure we highlight ephemeral volume options that require a specific driver support

Goal: 

The goal is to provide inline volume support (also known as Ephemeral volumes) via a CSI driver/operator. This epic also track the dev of the new admission plugin required to make inline volumes safe.

 

Problem: 

  • The only practical way to extend pods such that node local integrations can happen is with inline volumes. So if we want to integrate with IAM for per pod credentials, we need inline csi volumes. If we want to do better build cache integration, we need inline csi. 

 

Why is this important: 

  • (from https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html) Traditionally, volumes that are backed by CSI drivers can only be used with a PersistentVolume and PersistentVolumeClaim object combination. This feature will support ephemeral storage use cases and allows CSI volumes to be specified directly in the pod specification. At runtime, nested inline volumes follow the ephemeral lifecycle of their associated pods where the driver handles all phases of volume operations as pods are created and destroyed.
  • Vault integration can be implemented via in-line volumes (see https://github.com/deislabs/secrets-store-csi-driver/blob/master/README.md).
  • Inline volumes would allow us to give out tokens for cloud integration and nuke cloud credential operator’s use of secrets.
  • In OpenShift we already have Shared Resource CSI driver, which uses in-line CSI volumes to distribute cluster-wide secrets and/or config maps.

 

Dependencies (internal and external):

  • CSI API

 

Prioritized epics + deliverables (in scope / not in scope):

  • In Scope
    • A working CSI based inline volume
    • Documentation
    • Admision plugin
  • Not in Scope
    • Implementing the use cases for inline volumes (i.e. integration with IAM)

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

Customers:

Open questions:

 

Notes:

 

This flag is currently TechPreviewNoUpgrade:
https://github.com/dobsonj/api/blob/95216a844c16019d4e3aaf396492c95d19bf22c0/config/v1/types_feature.go#L122

Once the admission plugin has had sufficient testing and e2e tests are in place, then this can be promoted to GA and eventually remove the feature gate.

*As OCP user, I want to be able to use in-line CSI volumes in my Pods, so my apps work.

 

Since we will have admission plugin to filter out dangerous CSI drivers from restricted namespace, all users should be able to use CSI volumes in all SCCs.

Exit criteria:

* an unprivileged user + namespace can use in-line CSI volume of a "safe" CSI driver (e.g. SharedResource CSI driver) without any changes in "restricted-v2" SCC.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".

Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

Phase 3

  • Update components based on identified changes from phase 1
    • Update Machine API operator to run core controllers in platform "External" mode.

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As defined in the  External platform enhancement , a new platform is being added to OpenShift. To accommodate the phase 2 work, the CIO should be updated, if necessary, to react to the "External" platform in the same manner as it would for platform "None".

Please see the  enhancement and the parent plan OCPBU-5 for more details about this process.

Why is this important?

In phase 2 (planned for 4.13 release) of the external platform enhancement, the new platform type will be added to the openshift/api packages. As part of staging the release of this new platform we will need to ensure that all operators react in a neutral way to the platform, as if it were a "None" platform to ensure the continued normal operation of OpenShift.

Scenarios

  1. As a user I would like to enable the External platform so that I can supplement OpenShift with my own container network options. To ensure proper operation of OpenShift, the cluster ingress operator should not react to the new platform or prevent my installation of the custom driver so that I can create clusters with my own topology.

Acceptance Criteria

We are working to create an External platform test which will exercise this mechanism, see OCPCLOUD-1782

Dependencies (internal and external)

  1. This will require OCPCLOUD-1777

Previous Work (Optional):

Open questions::

Done Checklist

  • CI Testing - we will perform manual test while waiting for OCPCLOUD-1782
  • Documentation - only developer docs need to be updated at this time
  • QE - test scenario should be covered by a cluster-wide install with the new platform type
  • Technical Enablement - n/a
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 
  • ** - Downstream documentation merged: <link to meaningful PR>

as described in the epic, the CIO should be updated to react to the new "External" platform as it would for a "None" platform.

Epic Goal

As defined in the  External platform enhancement , a new platform is being added to OpenShift. To accommodate the phase 2 work, the CIRO should be updated, if necessary, to react to the "External" platform in the same manner as it would for platform "None".

Please see the  enhancement and the parent plan OCPBU-5 for more details about this process.

Why is this important?

In phase 2 (planned for 4.13 release) of the external platform enhancement, the new platform type will be added to the openshift/api packages. As part of staging the release of this new platform we will need to ensure that all operators react in a neutral way to the platform, as if it were a "None" platform to ensure the continued normal operation of OpenShift.

Scenarios

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As a user I would like to enable the External platform so that I can supplement OpenShift with my own Image Registry options. To ensure proper operation of OpenShift, the cluster image registry operator should not react to the new platform or prevent my installation of the custom driver so that I can create clusters with my own topology.

Acceptance Criteria

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

We are working to create an External platform test which will exercise this mechanism, see OCPCLOUD-1782

Dependencies (internal and external)

This will require OCPCLOUD-1777

Previous Work (Optional):

Open questions::

Done Checklist

  • CI Testing - we will perform manual test while waiting for OCPCLOUD-1782
  • Documentation - only developer docs need to be updated at this time
  • QE - test scenario should be covered by a cluster-wide install with the new platform type
  • Technical Enablement - n/a
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

as described in the epic, the CIRO should be updated to react to the new "External" platform as it would for a "None" platform.

 

Feature Overview

  • Enables OTA updates from OpenShift 4.12.x to OpenShift 4.13.x.

Goals

  • As a platform administrator, I want to upgrade my OpenShift cluster from a previous supported release to the current release, i.e. 4.12.x to 4.13.x.
  • Ensure upgrades work smoothly without impacting end user workloads (for HA clusters) from the previous release to the latest release for all supported OpenShift environments:
  • Connected and disconnected deployments
  • All support topologies (SNO, compact cluster, standard HA cluster, RWN)
  • All platforms and providers
  • Cloud and on-premises

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Epic Goal

  • Provide a convenient  way to migrate from a homogeneous to a heterogeneous cluster.

Why is this important?

  • So customers with an existing cluster can migrate to a heterogeneous payload rather than doing a fresh install, without needing to use oc adm upgrade --allow-explicit-upgrade --to-image "${PULLSPEC}".  OTA-658 and maybe some oc side tooling, if folks feel oc patch ... is too heavy (although see discussion in OTA-597 about policies for adding new oc subcommands).
  • So components (like which?) can make decisions (like what?) based on the "current" cluster architecture. OTA-659.

Scenarios

  1. Upgrade from a homogeneous release eg. 4.11.0-x86_64 to a heterogeneous release 4.11.0-multi.
  2. Ensure that ClusterVersion spec has a new architecture field to denote desired architecture of the cluster
  3. Ensure ClusterVersionStatus populates a new architecture field denoting the current architecture of the cluster.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1.    Should the migration also be an upgrade or should it be two separate steps? i.e, migrate to hetero release of same version and then upgrade?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Per spike Add field in ClusterVersion spec to request the target architecture, create "oc adm upgrade" sub-command that allows a convenient way to update a cluster from homogeneous -> heterogeneous while maintaining the current version, i.e. there will not be an option to specify version.

Goal:
Support migration from dual-stack IPv6 to single-stack IPv6.

Why is this important?
We have customers who want to deploy a dual stack cluster and then (eventually) migrate to single stack ipv6 once all of their ipv4 dependencies are eliminated. Currently this isn't possible because we only support ipv4-primary dual stack deployments. However, with the implementation of OPNET-1 we addressed many of the limitations that prevented ipv6-primary, so we need to figure out what remains to make this supported.

At the very least we need to remove the validations in the installer that requires ipv4 to be the primary address. There will also be changes needed in dev-scripts to allow testing (an option to make the v6 subnets and addresses primary, for example).

We have customers who want to deploy a dual stack cluster and then migrate to single stack ipv6 once all of their ipv4 dependencies are eliminated. Currently this isn't possible because we only support ipv4-primary dual stack deployments. However, with the implementation of OPNET-1 we addressed many of the limitations that prevented ipv6-primary, so we need to figure out what remains to make this supported. At the very least we need to remove the validations in the installer that require ipv4 to be the primary address. There will also be changes needed in dev-scripts to allow testing (an option to make the v6 subnets and addresses primary, for example).

Runtimecfg assumes ipv4-primary in some places today and we need to make that aware of whether a cluster is v4 or v6 primary.

In the IPI nodeip-configuration service we always prefer ipv4 as the primary node address. This will need to be made dynamic based on the order of networks configured.

The installer currently enforces ipv4-primary for dual stack deployments. We will need to remove/modify those validations to allow an ipv6-primary configureation.

 
Goal:
API and implementation work to provide the cluster admin with an option in the IngressController API to use PROXY protocol with IBM Cloud load-balancers. 

Description:
This epic extends the IngressController API essentially by copying the option we added in NE-330.  In that epic, we added a configuration option to use PROXY protocol when configuring an IngresssController to use a NodePort service or host networking.  With this epic (NE-1090), the same configuration option is added to use PROXY protocol when configuring an IngressController to use a LoadBalancer service on IBM Cloud. 

 
This epic tracks the API and implementation work to provide the cluster admin with an option in the IngressController API to use PROXY protocol with IBM Cloud load-balancers. 

This epic extends the IngressController API essentially by copying the option we added in NE-330.  In that epic, we added a configuration option to use PROXY protocol when configuring an IngresssController to use a NodePort service or host networking.  With this epic (NE-1090), the same configuration option is added to use PROXY protocol when configuring an IngressController to use a LoadBalancer service on IBM Cloud. 

Feature Overview

Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on Azure Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This epic covers the work to apply user defined tags to Azure created for openshift cluster available as tech preview.

The user should be able to define the azure tags to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for tagging when the resources are created.

Updating/deleting of tags added during cluster creation or adding new tags as Day-2 operation is out of scope of this epic.

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

Reference - https://issues.redhat.com/browse/RFE-2017

Installer creates below list of resources during create cluster phase and these resources should be applied with the user defined tags and the default OCP tag kubernetes.io/cluster/<cluster_name>:owned

Resources List

Resource Terraform API
Resource group azurerm_resource_group
Image azurerm_image
Load Balancer azurerm_lb
Network Security Group azurerm_network_security_group
Storage Account azurerm_storage_account
Managed Identity azurerm_user_assigned_identity
Virtual network azurerm_virtual_network
Virtual machine azurerm_linux_virtual_machine
Network Interface azurerm_network_interface
Private DNS Zone azurerm_private_dns_zone
DNS Record azurerm_dns_cname_record

Acceptance Criteria:

  • Code linting, validation and best practices adhered to
  • List of azure resources created by installer should have user defined tags and as well as the default OCP tag.

Installer generates Infrastructure CR in manifests creation step of cluster creation process based on the user provided input recorded in install-config.yaml. While generating Infrastructure CR platformStatus.azure.resourceTags should be updated with the user provided tags(installconfig.platform.azure.userTags).

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • Infrastructure CR created by installer should have azure user defined tags if any, in status field.

Issues found by QE team during pre-merge tests are reported in QE Tracker, which should be fixed.

Acceptance criteria:

  • Update UTs, if required
  • Update enhancement, if required

cluster-config-operator makes Infrastructure CRD available for installer, which is included in it's container image from the openshift/api package and requires the package to be updated to have the latest CRD.

Enhancement proposed for Azure tags support in OCP, requires cluster-ingress-operator to add azure userTags available in the status sub resource of infrastructure CR, to the azure DNS resource created.

cluster-ingress-operator should add Tags to the DNS records created.

Note: dnsrecords.ingress.operator.openshift.io and openshift-ingress CRD, usage to be identified.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Enhancement proposed for Azure tags support in OCP, requires machine-api-provider-azure to add azure userTags available in the status sub resource of infrastructure CR, to the azure virtual machines resource and the sub-resources created.

machine-api-provider-azure has a method CreateMachine() which creates below resources and tags should be applied

  • ApplicationSecurityGroup
  • AvailabilitySet
  • Group
  • LoadBalancer
  • PublicIPAddress
  • RouteTable
  • SecurityGroup
  • VirtualMachineExtension
  • Interface
  • VirtualMachine
  • VirtualNetwork

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

 

Overview 

HyperShift came to life to serve multiple goals, some are main near-term, some are secondary that serve well long-term. 

Main Goals for hosted control planes (HyperShift)

  • Optimize OpenShift for Cost/footprint/ which improves our competitive stance against the *KSes
  • Establish separation of concerns which makes it more resilient for SRE to manage their workload clusters (be it security, configuration management, etc).
  • Simplify and enhance multi-cluster management experience especially since multi-cluster is becoming an industry need nowadays. 

Secondary Goals

HyperShift opens up doors to penetrate the market. HyperShift enables true hybrid (CP and Workers decoupled, mixed IaaS, mixed Arch,...). An architecture that opens up more options to target new opportunities in the cloud space. For more details on this one check: Hosted Control Planes (aka HyperShift) Strategy [Live Document]

 

Hosted Control Planes (HyperShift) Map 

To bring hosted control planes to our customers, we need the means to ship it. Today MCE is how HyperShift shipped, and installed so that customers can use it. There are two main customers for hosted-control-planes: 

 

  • Self-managed: In that case, Red Hat would provide hosted control planes as a service that is managed and SREed by the customer for their tenants (hence “self”-managed). In this management model, our external customers are the direct consumers of the multi-cluster control plane as a servie. Once MCE is installed, they can start to self-service dedicated control planes. 

 

  • Managed: This is OpenShift as a managed service, today we only “manage” the CP, and share the responsibility for other system components, more info here. To reduce management costs incurred by service delivery organizations which translates to operating profit (by reducing variable costs per control-plane), as well as to improve user experience, lower platform overhead (allow customers to focus mostly on writing applications and not concern themselves with infrastructure artifacts), and improve the cluster provisioning experience. HyperShift is shipped via MCE, and delivered to Red Hat managed SREs (same consumption route). However, for managed services, additional tooling needs to be refactored to support the new provisioning path. Furthermore, unlike self-managed where customers are free to bring their own observability stack, Red Hat managed SREs need to observe the managed fleet to ensure compliance with SLOs/SLIs/…

 

If you have noticed, MCE is the delivery mechanism for both management models. The difference between managed and self-managed is the consumer persona. For self-managed, it's the customer SRE for managed its the RH SRE

High-level Requirements

For us to ship HyperShift in the product (as hosted control planes) in either management model, there is a necessary readiness checklist that we need to satisfy. Below are the high-level requirements needed before GA: 

 

  • Hosted control planes fits well with our multi-cluster story (with MCE)
  • Hosted control planes APIs are stable for consumption  
  • Customers are not paying for control planes/infra components.  
  • Hosted control planes has an HA and a DR story
  • Hosted control planes is in parity with top-level add-on operators 
  • Hosted control planes reports metrics on usage/adoption
  • Hosted control planes is observable  
  • HyperShift as a backend to managed services is fully unblocked.

 

Please also have a look at our What are we missing in Core HyperShift for GA Readiness? doc. 

Hosted control planes fits well with our multi-cluster story

Multi-cluster is becoming an industry need today not because this is where trend is going but because it’s the only viable path today to solve for many of our customer’s use-cases. Below is some reasoning why multi-cluster is a NEED:

 

 

As a result, multi-cluster management is a defining category in the market where Red Hat plays a key role. Today Red Hat solves for multi-cluster via RHACM and MCE. The goal is to simplify fleet management complexity by providing a single pane of glass to observe, secure, police, govern, configure a fleet. I.e., the operand is no longer one cluster but a set, a fleet of clusters. 

HyperShift logically centralized architecture, as well as native separation of concerns and superior cluster lifecyle management experience, makes it a great fit as the foundation of our multi-cluster management story. 

Thus the following stories are important for HyperShift: 

  • When lifecycling OpenShift clusters (for any OpenShift form factor) on any of the supported providers from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin):
  • I want to be able to use a consistent UI so I can manage and operate (observe, govern,...) a fleet of clusters.
  • I want to specify HA constraints (e.g., deploy my clusters in different regions) while ensuring acceptable QoS (e.g., latency boundaries) to ensure/reduce any potential downtime for my workloads. 
  • When operating OpenShift clusters (for any OpenShift form factor) on any of the supported provider from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin):
  • I want to be able to backup any critical data so I am able to restore them in case of hosting service cluster (management cluster) failure. 

Refs:

Hosted control planes APIs are stable for consumption.

 

HyperShift is the core engine that will be used to provide hosted control-planes for consumption in managed and self-managed. 

 

Main user story:  When life cycling clusters as a cluster service consumer via HyperShift core APIs, I want to use a stable/backward compatible API that is less susceptible to future changes so I can provide availability guarantees. 

 

Ref: What are we missing in Core HyperShift for GA Readiness?

Customers are not paying for control planes/infra components. 

 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumptions

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

HyperShift - proposed cuts from data plane

HyperShift has an HA and a DR story

When operating OpenShift clusters (for any OpenShift form factor) from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin) I want to be able to migrate CPs from one hosting service cluster to another:

  • as means for disaster recovery in the case of total failure
  • so that scaling pressures on a management cluster can be mitigated or a management cluster can be decommissioned.

More information: 

 

Hosted control planes reports metrics on usage/adoption

To understand usage patterns and inform our decision making for the product. We need to be able to measure adoption and assess usage.

See Hosted Control Planes (aka HyperShift) Strategy [Live Document]

Hosted control plane is observable  

Whether it's managed or self-managed, it’s pertinent to report health metrics to be able to create meaningful Service Level Objectives (SLOs), alert of failure to meet our availability guarantees. This is especially important for our managed services path. 

HyperShift is in parity with top-level add-on operators

https://issues.redhat.com/browse/OCPPLAN-8901 

Unblock HyperShift as a backend to managed services

HyperShift for managed services is a strategic company goal as it improves usability, feature, and cost competitiveness against other managed solutions, and because managed services/consumption-based cloud services is where we see the market growing (customers are looking to delegate platform overhead). 

 

We should make sure our SD milestones are unblocked by the core team. 

 

Note 

This feature reflects HyperShift core readiness to be consumed. When all related EPICs and stories in this EPIC are complete HyperShift can be considered ready to be consumed in GA form. This does not describe a date but rather the readiness of core HyperShift to be consumed in GA form NOT the GA itself.

- GA date for self-managed will be factoring in other inputs such as adoption, customer interest/commitment, and other factors. 
- GA dates for ROSA-HyperShift are on track, tracked in milestones M1-7 (have a look at https://issues.redhat.com/browse/OCPPLAN-5771

Epic Goal*

The goal is to split client certificate trust chains from the global Hypershift root CA.

 
Why is this important? (mandatory)

This is important to:

  • assure a workload can be run on any kind of OCP flavor
  • reduce the blast radius in case of a sensitive material leak
  • separate trust to allow more granular control over client certificate authentication

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. I would like to be able to run my workloads on any OpenShift-like platform.
    My workloads allow components to authenticate using client certificates based
    on a trust bundle that I am able to retrieve from the cluster.
  1. I don't want my users to have access to any CA bundle that would allow them
    to trust a random certificate from the cluster for client certificate authentication.

 
Dependencies (internal and external) (mandatory)

Hypershift team needs to provide us with code reviews and merge the changes we are to deliver

Contributing Teams(and contacts) (mandatory) 

  • Development - OpenShift Auth, Hypershift
  • Documentation -OpenShift Auth Docs team
  • QE - OpenShift Auth QE
  • PX - I have no idea what PX is
  • Others - others

Acceptance Criteria (optional)

The serviceaccount CA bundle automatically injected to all pods cannot be used to authenticate any client certificate generated by the control-plane.

Drawbacks or Risk (optional)

Risk: there is a throbbing time pressure as this should be delivered before first stable Hypershift release

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

Goals (aka. expected user outcomes)

To enable full support for control plane machine sets on Azure

 

Requirements (aka. Acceptance Criteria):

  • Generate CPMS for upgraded clusters
  • Document support for upgraded clusters
  • Ensure E2E testing for Azure clusters

Out of Scope

Any other cloud platforms

Background

Feature created from split of overarching Control Plane Machine Set feature into single release based effort

 

Customer Considerations

n/a

 

Documentation Considerations

Nothing outside documentation that shows the Azure platform is supported as part of Control Plane Machine Sets

 

Interoperability Considerations

n/a

Goal:

Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem:

There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important:

  • Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

Lifecycle Information:

  • Core

Previous Work:

Dependencies:

  • Etcd operator

Prioritized epics + deliverables (in scope / not in scope):

Estimate (XS, S, M, L, XL, XXL):

 

 

 

User Story:

As a developer, I want to be able to:

  • Create Azure control plane nodes using MachineSets.

so that I can achieve

  • More control over the nodes using the MachineAPI Operator.

Acceptance Criteria:

Description of criteria:

  • New CRD ControlPlaneMachineSet is used and populated.
  • New manifest is created for the ControlPlaneMachineSet.
  • Fields required for the CRD are set.

(optional) Out of Scope:

 

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

ovnk manifests in CNO is not up-to-date, we want to sync it with manifests in microshift repo .

With the enablement of OpenShift clusters with mixed architecture capable compute nodes it is necessary to have support for manifest listed images so the correct images/binaries can be pulled onto the relevant nodes.

Included in this should be the ability to

  • use oc debug successfully on all node types
  • support manifest listed images in the internal registry
  • have the ability to import manifest listed images

Epic Goal

  • Complete manifest lists support on image streams. The work was initiated on epic IR-192

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Acceptance Criteria

  • The pruner should delete dangling manifest lists (or OCI index) json from storage (dangling manifest lists are manifest lists that are not associated with an image object)
  • The pruner should keep manifest lists (or OCI index) json in storage for manifest lists that are associated with an image object

Documentation

Nothing new to document.

 

Acceptance Criteria

  • Sub-manifests Images are shown when a manifest list is described

Open Questions

  • Should oc display only children's SHAs or should it retrieve some information from children images?
    • Showing only SHA + platform (OS+arch) should suffice.

Acceptance Criteria

  • There should be information about the image index in the web console.

Open Questions

  • What kind of information should be there? Should it be for ImageStreamImages, or for Images, or for both? Should child manifests be clickable?
    • The answer to showing information about ImageStreamImages or Images depends on what is currently shown in the console - we should go with what's there.

ACCEPTANCE CRITERIA

  • Pushing a manifest list to the image registry should result in an image stream created for the manifest list
  • Image objects should be created for the manifest list, as well as all of its sub-manifests
  • The Image object for the manifest list should contain references to all of its sub-manifests under the dockerImageManifests field
  • Pulling a sub-manifest of a manifest list by digest should work when the user has access to the image stream

 

Notes:

  • When a manifest is pushed by sha an Image object should be created
  • You can use `skopeo copy` with the `--all` flag to push a manifest list and all its sub-manifests to the registry
  • Authorization needs to work the same they do for images created via imagestreamimport

ACCEPTANCE CRITERIA

  • The cmd line flag should result in the ImageStreamImport and ImageStream objects specifying the API manifest list flag

DOCUMENTATION

  • The new flag should be mentioned in the product documentation.

OPEN QUESTIONS

Enable OpenShift to support the Shield VMs capability on Google Cloud

 

 

 

 

Epic Goal

  • Support OpenShift and the IPI workflow on GCP to use Shielded VMs feature from Google Cloud

Why is this important?

  • Many Google Cloud customers want to leverage Shielded VMs feature while deploying OpenShift on GCP

Scenarios

  1. As a user, I want to be able to instruct the OpenShift Installer to use Shield VMs while deploying the platform on Google Cloud so I can use the Shield VMs feature from GCP on every Node

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  •  

Dependencies (internal and external)

  1. OCPBUGS-4522 coreos fail to boot on GCP when enabling secure boot

Open questions::

  1. Should we add API to support all shielded VMs options (Secure Boot, vTPM, Integrity Monitoring) or just Secure Boot? 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

This feature collects the items/enhacements that are planned for V4.13 release. As we are still transitioning from ODF to OCPVE team, its not a nice "OCPBU" feature as it should be. Please take a look at the attached EPICS to understand the actual topics.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Reduce CPU
    • Remove CPU limits from all containers
    • Reduce the CPU requests for all the pods to the "at rest" values shown in the attached analysis document.
  • Reduce Memory (same as with CPU)
  • Add workload partitioning annotation to the pods so that we can use workload partitioning
  • Reduce Image sizes

Resource measurements: https://docs.google.com/spreadsheets/d/1eLBkG4HhlKlxRlB9H8kjuXs5F3zzVOPfkOOhFIOrfzU/edit#gid=0

Why is this important?

  • LVMS running on edge systems requires a smaller footprint.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview
There are specific opportunities to improve the administrative experience when using the machine API, this topic is around the setting up of appropriate default values (saving time and errors) and also early validation of said values to allow quick detection of errors.

There are also specific areas where we lack capabilities that are key to the end user experience.

Goals

  • Include reasonable defaults where calls are made to APIs
  • Where possible validate calls made to APIs
  • Do this for all cloud provider interactions
  • Address key parts of missing functionality in subcomponents

Requirements

Requirement Notes isMvp?
Implement validation/defaulting for AWS   Yes
Implement validation/defaulting for GCP    
Implement validation/defaulting for Azure    
 Implement validation/defaulting for vSphere    

Out of Scope

n/a

Background, and strategic fit
This type of checking can cut down on errors, reducing outages and also time wasted on identifying issues

Assumptions

Customer Considerations
Customers must be ready to handle if their settings are not accepted.

Documentation Considerations

  • Target audience: cluster admins
  • Updated content: update docs to mention the defaults selected and the validation that occurs, also what to do in the event validation fails.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story

As an OpenShift admin on GCP I want to replace my control plane node. Currently, the new control plane machine does not get assigned instance group. This is prevents the internal load balancer working on the new node until you set it manually using the GCP console. This prevents automatic control plane replacement using the CPMSO on GCP.

Background

CAPI implementation:
_https://github.com/openshift/cluster-api-provider-gcp/blob/f6b71187180cb35b93caee08eddbccb66cb28ab6/cloud/services/compute/instances/reconcile.go#L141_

Steps

  • Create abstraction around GCP instance group API
  • Register machine to instance group on create
  • Remove machine from instance group on delete

Stakeholders

  • Cluster Infrastructure (CPMSO)

Definition of Done

  • Control plane machines are reconciled into correct instance group
  • Docs
  • N/A
  • Testing
  • Manual node replacement as CPMS is currently disabled on GCP

Note: Replace text in red with details of your feature request.

Feature Overview

Extend the Workload Partitioning feature to support multi-node clusters.

Goals

Customers running RAN workloads on C-RAN Hubs (i.e. multi-node clusters) that want to maximize the cores available to the workloads (DU) should be able to utilize WP to isolate CP processes to reserved cores.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Add support to Installer to bootstrap cluster with the configurations for CPU Partitioning based off of the infrastructure flag and NTO generated configurations.

We need to call NTO bootstrap render during the bootstrap cycle. This will follow the same pattern that MCO follows and other components that render during bootstrap.

Since this feature requires that it be turned on ONLY at install time, and can not be turned off, the best place we've found to set the Infrastructure.Status option is through the openshift installer. This has a few benefits, the primary of which being simplifying how this feature get's used by upstream teams such as Assisted Installer and ZTP. If we expose this option as an install config it makes it trivial for those consumers to support turning on this feature at install time.

We'll need to update the openshift installer configuration option to support a flag for CPU Partitioning at install time.

We'll need to add a new flag to the InstallConfig

cpuPartitioningMode: None | AllNode

Update admission controller to remove check for SNO

Repo Link

Add Node Admission controller to stop nodes from joining that do not have CPU Partitioning turned on.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • Implement a cluster template user experience in ACM

Why is this important?

  • OpenShift installation is hard, cluster templates can ease the UX by pre-defining install configurations
  • Admins can constrain cluster users to pre-defined infrastructure configurations
  • Create cluster wizard is tedious to use after multiple times

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. the WIP end to end design flow: https://docs.google.com/presentation/d/1RIJ8i7ZKp8TidYWKq3njjhEcMMA-ImO0M5CtJBWIWIU/edit#slide=id.gc16973c501_0_4968
  2. Full designs: https://marvelapp.com/prototype/6g89ci7/screen/86197674

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 OCP console has MarkdownView which can be used to interpret Markdown. We should expose it so it can be used by our plugin to interpret Markdown-based descriptions of ClusterTemplate-s

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

For users who are using OpenShift but have not yet begun to explore multicluster and we we offer them.

I'm investigating where Learning paths are today and what is required.

As a user I'd like to have learning path for how to get started with Multicluster.
Install MCE
Create multiple clusters
Use HyperShift
Provide access to cluster creation to devs via templates
Scale up to ACM/ACS (OPP?)

Status
https://github.com/patternfly/patternfly-quickstarts/issues/37#issuecomment-1199840223

Goal: Resources provided via the Dynamic Resource Allocation Kubernetes mechanism can be consumed by VMs.

Details: Dynamic Resource Allocation

Goal

Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.

Description

The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.

NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.

When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.

Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs [7]. There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.

User Stories

  • GPU consumption optimization
    "As an Admin, I want to let NVIDIA GPU DRA driver provision vGPUs for OpenShift Virtualization, so that it optimizes the allocation with dynamic provisioning of time or MIG backed vGPUs"
  • GPU mixed types per server
    "As an Admin, I want to be able to mix different types of GPU to collocate different types of workloads on the same host, in order to improve multi-pod/stack performance.

Non-Requirements

  • List of things not included in this epic, to alleviate any doubt raised during the grooming process.

Notes

  • Any additional details or decisions made/needed

References

Done Checklist

Who What Reference
DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion <link or reference to Polarion>
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

1. Proposed title of this feature request
Show node-role.kubernetes.io in Motoring Dashboard to easily identify nodes by role

2. What is the nature and description of the request?
In Monitoring Dashboards such as /monitoring/dashboards/grafana-dashboard-k8s-resources-node, /monitoring/dashboards/grafana-dashboard-node-cluster-rsrc-use and /monitoring/dashboards/grafana-dashboard-node-rsrc-use it would helpful to add node-role.kubernetes.io to the OpenShift - Node selection and view to easily understand what role the OpenShift - Node has.

Especially in Cloud environments, it's hard to keep OpenShift - Node(s) separated and understand what the respective OpenShift - Node is doing. Showing the node-role.kubernetes.io could help here as it would allow to separate between important OpenShift - Nodes, such as Control-Plane Node, Infra or simple worker Node.

Including in addition for https://kubernetes.io/docs/reference/labels-annotations-taints/#failure-domainbetakubernetesiozone to select the different available Availability Zones (if available) would be great too, to potentially look at the data for a specific node role in a specific availability zone.

3. Why does the customer need this? (List the business requirements here)
For troubleshooting purpose and observability reason, it's often required to switch between CLI and console to understand what node-role and availability zone Administrator want to review. Having this all integrated in the Web-UI will improve observability and allow better filtering when looking into specific nodes, availability zones, etc.

4. List any affected packages or components.
OpenShift - Management Console

Epic Goal

Our dashboards should allow filtering data by node attributes.

We can facilitate this by extracting node labels from promql metrics and add an option to filter by these labels.

 

Why is this important?

From reporter: "In Monitoring Dashboards such as /monitoring/dashboards/grafana-dashboard-k8s-resources-node, /monitoring/dashboards/grafana-dashboard-node-cluster-rsrc-use and /monitoring/dashboards/grafana-dashboard-node-rsrc-use it would helpful to add node-role.kubernetes.io to the OpenShift - Node selection and view to easily understand what role the OpenShift - Node has."

> "For troubleshooting purpose and observability reason, it's often required to switch between CLI and console to understand what node-role and availability zone Administrator want to review. Having this all integrated in the Web-UI will improve observability and allow better filtering when looking into specific nodes, availability zones, etc."

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/OU-91

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 Monitoring Dashboards such as /monitoring/dashboards/grafana-dashboard-k8s-resources-node, /monitoring/dashboards/grafana-dashboard-node-cluster-rsrc-use and /monitoring/dashboards/grafana-dashboard-node-rsrc-use should have a node role filter.

This role can be extracted from a label using label_replace

I have not made any changes to the Node Exporter / USE Method / Cluster dashboard, as discussed in the parent epic: https://issues.redhat.com/browse/MON-2845?focusedCommentId=21681945&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21681945.

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

OCP/Telco Definition of Done

Epic Template descriptions and documentation.

Epic Goal

Why is this important?

  • This regression is a major performance and stability issue and it has happened once before.

Drawbacks

  • The E2E test may be complex due to trying to determine what DNS pods are responding to DNS requests. This is straightforward using the chaos plugin.

Scenarios

  • CI Testing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. SDN Team

Previous Work (Optional):

  1. N/A

Open questions::

  1. Where do these E2E test go? SDN Repo? DNS Repo?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To refactor various unit test in cluster-ingress-operator to align with desire unit test standards. The unit tests are in need of various clean up to meet the standards of the network edge such as:
    • Using t.run in all unit tests for sub-test capabilities
    • Removing extraneous test cases
    • Fixing incorrect error messages

Why is this important?

  • Maintaining standards in unit tests is important for the debug-ability of our code

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests generally meet our software standards

Dependencies (internal and external)

  1.  

Previous Work (Optional):

  1. For shift week, Miciah provided a handful commits https://github.com/Miciah/cluster-ingress-operator/commits/gateway-api that was the motivation to create this epic. 

Open questions::

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Provide a long term solution to SELinux context labeling in OCP.

 
Why is this important? (mandatory)

As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.

https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.

This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. Apply new context when there is none
  2. Change context of all files/folders when changing context
  3. RWO & RWX PVs
    1. ReadWriteOncePod PVs first
    2. RWX PV in a second phase

As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context

More on design & scenarios in the KEP  and related epic STOR-1173

Dependencies (internal and external) (mandatory)

None for the core feature

However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - 
  • Others -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

This Epic is to track upstream work in the Storage SIG community

This Epic is to track the SELinux specific work required. fsGroup work is not included here.

Goal: 

Continue contributing to and help move along the upstream efforts to enable recursive permissions functionality.

Finish current SELinuxMountReadWriteOncePod feature upstream:

  • Implement it in all volume plugins (current alpha has just iSCSI and CSI
  • Add e2e test + fixing all tests that don't work well with SELinux
  • Implement necessary changes in volume reconstruction to reconstruct also SELinux context.

The feature is probably going to stay alpha upstream.

Problem: 

Recursive permission change takes very long for fsGroup and SELinux. For volumes with many small files Kubernetes currently does a chown for every file on the volume (due to fsGroup). Similarly for container runtimes (such as CRI-O) a chcon of every file on the volume is performed due to SCC's SELinux context. Data on the volume may already have the correct GID/SELinux context so Kubernetes needs way to detect this automatically to avoid the long delay.

Why is this important: 

  • A user wants to bring their pod online quickly and efficiently.  

Dependencies (internal and external):

 

Prioritized epics + deliverables (in scope / not in scope):

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

Customers:

Open questions:

  •  

Notes:

As OCP developer (and as OCP user in the future), I want all CSI drivers shipped as part of OCP to support mounting with -o context=XYZ, so I can test with CSIDriver.SELinuxMount: true (or my pods are running without CRI-O recursively relabeling my volume).

 

In detail:

  • For CSI drivers based on block devices, pass host's /etc/selinux and /sys/fs/ to the CSI drvier container on the node as HostPath volumes
  • For CSI drivers based on NFS / CIFS: do the same as for block volumes (it won't harm the driver in any way), but investigate if these drivers can actually run with CSIDriver.SELinuxMount: true.

Details: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#selinux-support-in-volumes

 

Exit criteria:

  • Verify that CSI drivers shipped by OCP based on block volumes mount volumes with -o context=xyz instead of relabeling the volumes by CRI-O. That should happen when all these conditions are satisfied:
    • SELinuxMountReadWriteOncePod and ReadWriteOncePod feature gates are enabled
    • CSIDriver.SELinuxMount is set to true manually for the CSI driver. OCP will not do it by default in 4.13, because it requires the alpha feature gates from the previous bullet.
    • PVC has AccessMode: [ReadWriteOncePod] 
    • Pod has SELinux context explicitly assigned, i.e. pod.spec.securityContext (or pod.spec.containers[*].securityContext) has seLinuxOptions set, incl. {{level }}(based on SCC, OCP might do it automatically)
  • This is alpha / dev preview feature, so QE might done when graduating to Beta / tech preview.

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

Feature Overview

Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.  

Goals

  1. Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are: 
    • Control plane upgrade
    • Worker nodes upgrade
    • Workload enabling upgrade (i..e. Router, other components) or infra nodes
  2. Better visibility into any errors during the upgrades and documentation of what they error means and how to recover. 
  3. An user experience around an end-2-end back-up and restore after a failed upgrade 
  4. OTA-810  - Better Documentation: 
    • Backup procedures before upgrades. 
    • More control over worker upgrades (with tagged pools between user Vs admin)
    • The kinds of pre-upgrade tests that are run, the errors that are flagged and what they mean and how to address them. 
    • Better explanation of each discrete step in upgrades, and what each CVO Operator is doing and potential errors, troubleshooting and mitigating actions.

References

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Revamp our Upgrade Documentation to include an appropriate level of detail for admins

Why is this important?

  • Currently Admins have nothing which explains to them how upgrades actually work and as a result when things don't go perfectly they panic
  • We do not sufficiently, or at least within context of Upgrade Docs, explain the differences between Degraded and Available statuses
  • We do not explain order of operations
  • We do not explain protections built into the platform which protect against total cluster failure, ie halting when components do not return to healthy state within exp

Scenarios

  1. Move out channel management to its own chapter
  2. Explain or link to existing documentation which addresses the differences between Degraded=True and Available=False
  3. Explain Upgradeable=False conditions and other aspects of upgrade preflight strategy that Operators should be indicating when its unsafe to upgrade
  4. Explain basics of how the upgrade is applied
    1. CVO fetches release image
    2. CVO updates operators in the following order
    3. Each operator is expected to monitor for success
    4. Provide example ordering of manifests and command to extract release specific manifests and infer the ordering
  5. Explain how operators indicate problems and generic processes for investigating them
  6. Explain the special role of MCO and MCP mechanisms such as pausing pools
  7. Provide some basic guidance for Control Plane duration, that is exclude worker pool rollout duration (90-120 minutes is normal)

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. There was an effort to write up how to use MachineConfig Pools to partition and optimize worker rollout in https://issues.redhat.com/browse/OTA-375

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The CVO README is currently aimed at CVO devs. But there are way more CVO consumers than there are CVO devs. We should aim the README at "what does the CVO do for my clusters?", and push the dev docs down under docs/dev/.

User Story

As a managed application services developer, I want to install addons, use syncsets, scale nodes and query ingresses, so that I offer Red Hat OpenShift Streams on Azure.

Acceptance Criteria

  • Create/Delete ARO clusters through api.openshift.com
  • Install OCM addons on ARO clusters through api.openshift.com
  • Create/Update/Delete SyncSets on ARO clusters through api.openshift.com
  • Scale compute nodes on ARO clusters through api.openshift.com
  • Query the cluster DNS through api.openshift.com

Default Done Criteria

  • All existing/affected SOPs have been updated.
  • New SOPs have been written.
  • Internal training has been developed and delivered.
  • The feature has both unit and end to end tests passing in all test
    pipelines and through upgrades.
  • If the feature requires QE involvement, QE has signed off.
  • The feature exposes metrics necessary to manage it (VALET/RED).
  • The feature has had a security review.* Contract impact assessment.
  • Service Definition is updated if needed.* Documentation is complete.
  • Product Manager signed off on staging/beta implementation.

Dates

Integration Testing:
Beta:
GA:

Current Status

GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.

References

Links to Gdocs, github, and any other relevant information about this epic.

User Story:

As an ARO customer, I want to be able to:

  • use first-party service principals to authenticate

so that I can

  • use first party resource providers for provisioning

Acceptance Criteria:

Description of criteria:

  • Installer SDKs can auth with the 1st service principal
  • Terraform can auth with the 1st service principal
  • "local" testing of this functionality (we need to setup the ability to try this out)

(optional) Out of Scope:

The installer will not accept a separate service principal to pass to the cluster as described in HIVE-1794. Instead Hive will write the separate cred into the manifests.

Engineering Details:

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Ensure Shared Resources properly deploys on hypershift based OCP per prior art for items managed by cluster storage operator

Why is this important?

  • In order to promote from  tech preview to GA, shared resources need to properly deploy on hypershift

Scenarios

  1. As a developer, I want to consume shared Secrets and ConfigMaps in my workloads so that I can have access to shared credentials and configuration.
  2. As a cluster admin, I want the Insights operator to automatically create a SharedSecret for my cluster's simple content access certificate.
  3. As a cluster admin/SRE, I want OpenShift to use SharedConfigMaps to distribute cluster certificate authorities so that data is not duplicated in ConfigMaps across my cluster.

Acceptance Criteria

  • Pods must have readOnly: true set to use the shared resource CSI Driver - admission should be rejected if this is not set.
  • Documentation updated to reflect this requirement.
  • Users (admins?) are not allowed to create SharedSecrets or SharedConfigMaps with the "openshift-" prefix.

Dependencies (internal and external)

  1. Guidance / review / approval from OCP SMEs in hypershift/storage
  2. Arch review for the enhancement proposal (Apiserver/control plane team)

Previous Work (Optional):

  1. BUILD-293 - Shared Resources tech preview

Open questions::

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Story (Required)

  1. As a developer, I want to consume shared Secrets and ConfigMaps in my workloads so that I can have access to shared credentials and configuration from a GA OCP install even on hypershift
  2. As a cluster admin, I want the Insights operator to automatically create a SharedSecret for my cluster's simple content access certificate from a GA OCP install even on hypershift
  3. As a cluster admin/SRE, I want OpenShift to use SharedConfigMaps to distribute cluster certificate authorities so that data is not duplicated in ConfigMaps across my cluster from a GA OCP install even on hypershift

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

Background (Required)

https://github.com/openshift/csi-driver-shared-resource-operator/pull/71

https://github.com/openshift/cluster-storage-operator/pull/342

https://github.com/openshift/origin/pull/27730

https://github.com/openshift/release/pull/36433

https://github.com/openshift/cluster-storage-operator/pull/343

https://github.com/openshift/openshift-controller-manager/pull/251

https://redhat-internal.slack.com/archives/C01C8502FMM/p1676472369732279

Currently, looks like we need to merge the SR operator  changes first (they have proven to be benign in non-hypershift) so that we can complete testing in the cluster-storage-operator PR

<Describes the context or background related to this story>

Out of scope

<Defines what is not included in this story>

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

Acceptance Criteria (Mandatory)

<Describe edge cases to consider when implementing the story and defining tests>

<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

Legend

Unknown

Verified

Unsatisfied

Done Checklist

  • Code is completed, reviewed, documented and checked in
  • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
  • Continuous Delivery pipeline(s) is able to proceed with new code included
  • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
  • Acceptance criteria are met

Story (Required)

  1. As a developer, I want to consume shared Secrets and ConfigMaps in my workloads so that I can have access to shared credentials and configuration from a GA OCP install even on hypershift
  2. As a cluster admin, I want the Insights operator to automatically create a SharedSecret for my cluster's simple content access certificate from a GA OCP install even on hypershift
  3. As a cluster admin/SRE, I want OpenShift to use SharedConfigMaps to distribute cluster certificate authorities so that data is not duplicated in ConfigMaps across my cluster from a GA OCP install even on hypershift

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

Background (Required)

https://github.com/openshift/csi-driver-shared-resource-operator/pull/71

https://github.com/openshift/cluster-storage-operator/pull/342

https://github.com/openshift/origin/pull/27730

https://github.com/openshift/release/pull/36433

https://github.com/openshift/cluster-storage-operator/pull/343

https://github.com/openshift/openshift-controller-manager/pull/251

https://redhat-internal.slack.com/archives/C01C8502FMM/p1676472369732279

Currently, looks like we need to merge the driver changes first (they have proven to be benign in non-hypershift) so that we can complete testing in the cluster-storage-operator PR

<Describes the context or background related to this story>

Out of scope

<Defines what is not included in this story>

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

Acceptance Criteria (Mandatory)

<Describe edge cases to consider when implementing the story and defining tests>

<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

Legend

Unknown

Verified

Unsatisfied

Done Checklist

  • Code is completed, reviewed, documented and checked in
  • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
  • Continuous Delivery pipeline(s) is able to proceed with new code included
  • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
  • Acceptance criteria are met

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Require volumes that use the Shared Resources CSI driver to specify readOnly: true in order to create the pod
  • Reserve the "openshift-" prefix for SharedSecrets and SharedConfigMaps, such that these resources can only be created by OpenShift operators. We must do this while the driver is tech preview.

Why is this important?

  • readOnly: true must be specified in order for the driver to mount the volume correctly. If this is not set, the volume mount is rejected and the pod will be stuck in a Pending/Initializing state.
  • A validating admission webhook will ensure that the pods won't be created in such a state, improving user experience.
  • Openshift operators may want/need to create SharedSecrets and SharedConfigMaps so they can be used as system level resources. For example, Insights Operator can automatically create a SharedSecret for the Simple Content Access cert.

Scenarios

  1. As a developer, I want to consume shared Secrets and ConfigMaps in my workloads so that I can have access to shared credentials and configuration.
  2. As a cluster admin, I want the Insights operator to automatically create a SharedSecret for my cluster's simple content access certificate.
  3. As a cluster admin/SRE, I want OpenShift to use SharedConfigMaps to distribute cluster certificate authorities so that data is not duplicated in ConfigMaps across my cluster.

Acceptance Criteria

  • Pods must have readOnly: true set to use the shared resource CSI Driver - admission should be rejected if this is not set.
  • Documentation updated to reflect this requirement.
  • Users (admins?) are not allowed to create SharedSecrets or SharedConfigMaps with the "openshift-" prefix.

Dependencies (internal and external)

  1. ART - to create payload image for the webhook
  2. Arch review for the enhancement proposal (Apiserver/control plane team)

Previous Work (Optional):

  1. BUILD-293 - Shared Resources tech preview

Open questions::

  1. From email exchange with David Eads:  "Thinking ahead to how we'd like to use this in builds once we're GA, are we likely to choose openshift-etc-pki-entitlement as one of our well-known names?  If we do, what sort of validation (if any) would we like to provide on the backing secret and does that require any new infrastructure?"

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a user of an OpenShift Cluster i should be restricted from creating Shared Secrets and ConfigMaps with the "openshift-" prefix unless my access level is cluster-admin or above

Acceptance Criteria

  • Non-cluster admins should NOT be able to create Shared Secrets and ConfigMaps with prefix that is "openshift-"
  • Cluster admins should be able to create Shared Secrets and ConfigMaps with the "openshift-" prefix.
  • Integration testing to verify behavior

QE Impact

  • Behavior will need to be verified

Docs Impact

  • Docs will need to reflect this new behavior

PX Impact

  • None

Notes

  • Creating namespaces with the "openshift-" prefix is already restricted. That code could be used as a precedent for this.

Goals

Track goals/requirements for self-managed GA of Hosted control planes on AWS using the AWS Provider.

  • AWS flow via the AWS provider is documented. 
    • Make sure the documentation with HyperShiftDeployment is removed.
    • Make sure the documentation uses the new flow without HyperShiftDeployment 
  • HyperShift has a UI wizard with ACM/MCE for AWS. 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Overview

Today upstream and the more complete documentation of HyperShift lives on https://hypershift-docs.netlify.app/.

However product documentation today live under https://access.redhat.com/login?redirectTo=https%3A%2F%2Faccess.redhat.com%2Fdocumentation%2Fen-us%2Fred_hat_advanced_cluster_management_for_kubernetes%2F2.6%2Fhtml%2Fmulticluster_engine%2Fmulticluster_engine_overview%23hosted-control-planes-intro 

Goal

The goal of this Epic is to extract important docs and establish parity between what's documented and possible upstream and product documentation.

 

Multiple consumers have not realised a newer version of a CPO (spec.release) is not guaranteed to work with an older HO.

This is stated here https://hypershift-docs.netlify.app/reference/versioning-support/

but empiric evidences like OCM integration are telling us this is not enough.

We already deploy a CM in the HO namespace with the HC supported versions.

Additionally we can add an image label with latest HC version supported by the operator so you can quickly docker inspect...

Goal:
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.  

 Description:
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release.  This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.   

For OpenShift 4.13, this means bumping to 2.6.  

As a cluster administrator, 

I want OpenShift to include a recent HAProxy version, 

so that I have the latest available performance and security fixes.  

 

We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release.  This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.   

For OpenShift 4.14, this means bumping to 2.6.  

Feature Overview (aka. Goal Summary)  

The Assisted Installer is used to help streamline and improve the install experience of OpenShift UPI. Given the install footprint of OpenShift on IBM Power and IBM zSystems we would like to bring the Assisted Installer experience to those platforms and easy the installation experience.

 

Goals (aka. expected user outcomes)

Full support of the Assisted Installer for use by IBM Power and IBM zSystems

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

As a multi-arch development engineer, I would like to evaluate if the assisted installer is a good fit for simplifying UPI deployments on Power and Z.

Acceptance Criteria

  • Evaluation report of market opportunity/impact by P&Z offering managers
  • Stories filed for delivering Assisted Installer.
  • Do we need tests every release? Every other? Only major releases?
  • Do we test a full installation every time, or just the points where architecture is relevant (generating ISOs for example)
  • Dual-stack testing?

For assisted installer, the nodes will be boot up from cdrom(ISO) or netboot(Network), after RHCOS installed to target disk, the target need to be set as boot device.

  • Set feature support levels for things like ODF and disk encryption in the UI for ppc64l3 and s390x
  • Might need flag to say only KVM is supported, no z/VM, if that is the case for 4.13

This epic contains all the OLM related stories for OCP release-4.13

Epic Goal

  • Track all the stories under a single epic

Description/Acceptance Criteria:

  • Add RBAC for the console-operator so it can GET/LIST/WATCH OLMConfig  cluster config. The RBAC should be added to console-operator cluster-role rules 
  • The console operator should watch the spec.features.disableCopiedCSVs property of the OLM cluster config. When this property is true, the console-config should be updated "clusterInfo.copiedCSVsDisabled" field accordingly, and rollout a new version of console.

Problem

As an Operator author, I want to be able to specify where my Operators to run (on infra, master, or worker nodes) so my end-users can easily install them through OperatorHub in the console without special setups.

Acceptance Criteria

  • Operators can assign a Namespace object template in YAML to the provided `operatorframework.io/suggested-namespace-template` CSV annotation to specify how the suggested namespace is being created by the console during the installation.
  • During the installation, UI will:
  • populate the "Installed Namespace" dropdown using the `metadata.name` field in the attached Namespace YAML manifest
  • create the namespace object using the Namespace YAML manifest being assigned to `operatorframework.io/suggested-namespace-template` CSV annotation.
  • If end-users change the "Installed Namespace" dropdown to another namespace, the UI shows warning messages so users know the user-selected namespace might not have all the correct setup recommended by the Operator and it might not run correctly/successfully.
  • If both `suggested-namespace` and `suggested-namespace-template` annotation are present in CSV template should take precedence.

Details

The console adds support to take the value field of a CSV annotation as the Namespace YAML template to create a Namespace object for installing the Operator.

CSV Annotation Example

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: ""
  name: my-operator-namespace

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

We need a way to show metrics for workloads running on spoke clusters. This depends on ACM-876, which lets the console discover the monitoring endpoints.

  • Console operator must discover the external URLs for monitoring
  • Console operator must pass the URLs and CA files as part of the cluster config to the console backend
  • Console backend must set up proxies for each endpoint (as it does for the API server endpoints)
  • Console frontend must include the cluster in metrics requests

Open Issues:

We will depend on ACM to create a route on each spoke cluster for the prometheus tenancy service, which is required for metrics for normal users.

 

Openshift console backend should proxy managed cluster monitoring requests through the MCE cluster proxy addon to prometheus services on the managed cluster. This depends on https://issues.redhat.com/browse/ACM-1188

 

This epic contains all the Dynamic Plugins related stories for OCP release-4.13

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

The console only displays list of `Pending` or `Failed` plugins the Cluster Overview Dynamic Plugin Status card item popover when one or more dynamic plugins has a status of `Pending` or `Failed` (added in https://github.com/openshift/console/pull/11664).

https://issues.redhat.com/browse/HAC-1615 will add additional information regarding why a plugin has a `Failed` status.

The aforementioned popover should be updated to include this additional `Failed` information once it is available.

Additionally, the `Console plugins` tab (e.g., k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins) only displays a list of `ConsolePlugin` resources augmented with `Version` and `Description` data culled from `Loaded` dynamic plugins. This page should actually show dynamic plugins with a `status` of `Pending` or `Failed` as well, either through the addition of a `status` column or additional tables (design TBD with UXD). The aforementioned additional information regarding why a plugin has `Failed` should also be added to the page as well.

 

Acceptance Criteria: Update the popup on the dashboard and update notification drawer with failure reason.

Feature Goal: Unify the management of cluster ingress with a common, open, expressive, and extensible API.

Why is this Important? Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

Functional Requirements

  • Add support for Istio as a Gateway API implementation.
    • NE-1105 Management by an operator (possibly cluster-ingress-operator, OSSM operator, or a new operator)
    • Feature parity with OpenShift Router, where appropriate.
      • NE-1096    Provide a solution to support re-encrypt in Gateway API
      • NE-1097    Provide a solution to support passthrough in Gateway API
      • NE-1098    Research and select OSSM Istio image that provides enough features
    • Performance parity evaluation of Envoy and HAProxy.
    • NE-1102    Add oc command line support for Gateway API objects
    • NE-1103    Evaluate idling support for Gateway API
  • Avoid conflict with partner solutions (such as F5). 
    • Provide a solution that partners could integrate with (reduce dependencies on Istio by assuming plugins)
  • Avoid conflict with integrations (such as GKE) for hybrid cloud use cases.
  • NE-1106 Advanced routing capabilities currently unavailable in OCP.
    • More powerful path-based routing.
    • Header-based routing
    • Traffic mirroring
    • Traffic splitting (single and multi cluster)
    • Other features, based on time constraints
      • NE-1000 Understand Gateway API listener collapsing and how Istio Gateway implements
      • NE-1016 Investigate and document External DNS integration with Gateway API
      • Non-HTTP types of traffic (arbitrary TCP/UDP).
         
         
  • Add Gateway API support with OSSM service mesh.
    • Avoid conflict between Istio for ingress use-cases and Istio for mesh use-cases.
    • NE-1074 and NE-1095 Enable a unified control plane for ingress and mesh. 
    • NE-1035 Determine what OSSM release (based on what Istio release)...
  • Add Gateway API support for serverless.

Non-Functional Requirements:

  • NE-1034 Installation
  • NE-1110 Documentation
  • Release technical enablement
  • OCP CI integration
  • Continued upstream development to mature Gateway API and Istio support for the same.

Open Questions:

  • Integration with HAProxy?
  • Gateway is more than Ingress 2.0, how do we align with other platform components such as serverless and service mesh to ensure we're providing a complete solution?

Documentation Considerations:

  • Explain the resource model
  • Explain roles and how they align to Gateway API resources
  • Explain the extension points and provide extension point examples.
  • Xref upstream docs.

User Story: As a cluster admin, I want to create a gatewayclass and a gateway, and OpenShift should configure Istio/Envoy with an LB and DNS, so that traffic can reach httproutes attached to the gateway.

The operator will be one of these (or some combination):

  • cluster-ingress-operator
  • OSSM operator
  • a new operator

Functionality includes DNS (NE-1107), LoadBalancer (NE-1108), , and other operations formerly performed by the cluster-ingress-operator for routers.

  • configures GWAPI subcomponents
    • Installs GWAPI Gateway CRD
  • installs Istio (if needed) when Gateway and GatewayClasses are created

Requires design document or enhancement proposal, breakdown into more specific stories.

(probably needs to be an Epic, will move things around later to accomodate that).

 

Out of scope for enhanced dev preview:

  • Unified Control Plane operations (NE-1095)
  • Installs RBAC that restricts who can configure Gateway and GatewayClasses 

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

To be broken into one feature epic and a spike:

  • feature: error type disambiguation and error propagation into operator status
  • *spike: general improvement on making errors more actionable for the end user*

 

The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:

  1. The MCO operator status
  2. The MCPool status
  3. The MCController/Daemon pod logs
  4. The journal logs on the node

 

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

  1. The real error is hard to find
  2. The error message is often generic and ambiguous
  3. The solution/workaround is not clear at all

 

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

  1. An incomplete update happened, and something rebooted the node
  2. The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
  3. The user modified something manually
  4. Another operator modified something manually
  5. Some other service/network manager overwrote something MCO writes

Etc. etc.

 

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

 

  1. De-ambigufying different error cases with the same message
  2. Adding more error catching, including journal logs and rpm-ostree errors
  3. Propagating full error messages further up the stack, up to the operator status in a clear manner
  4. Adding actionable fix/information messages alongside the error message

 

With a side objective of observability, including reporting all the way to the operator status items such as:

  1. Reporting the status of all pools
  2. Pointing out current status of update/upgrade per pool
  3. What the update/upgrade is blocking on
  4. How to unblock the upgrade

Approaches can include:

  1. Better error messaging starting with common error cases
  2. De-ambigufying config mismatch
  3. Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
  4. Capturing full daemon error message back to pool/operator status
  5. Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
  6. Adding better alerting messages for MCO errors

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

  1. The real error is hard to find
  2. The error message is often generic and ambiguous
  3. The solution/workaround is not clear at all

 

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

  1. An incomplete update happened, and something rebooted the node
  2. The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
  3. The user modified something manually
  4. Another operator modified something manually
  5. Some other service/network manager overwrote something MCO writes

Etc. etc.

 

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

 

  1. De-ambigufying different error cases with the same message
  2. Adding more error catching, including journal logs and rpm-ostree errors
  3. Propagating full error messages further up the stack, up to the operator status in a clear manner
  4. Adding actionable fix/information messages alongside the error message

 

With a side objective of observability, including reporting all the way to the operator status items such as:

  1. Reporting the status of all pools
  2. Pointing out current status of update/upgrade per pool
  3. What the update/upgrade is blocking on
  4. How to unblock the upgrade

Approaches can include:

  1. Better error messaging starting with common error cases
  2. De-ambigufying config mismatch
  3. Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
  4. Capturing full daemon error message back to pool/operator status
  5. Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
  6. Adding better alerting messages for MCO errors

  • Options

 

As of OCP 4.11, the MCD no longer does draining, and instead the MCC does the drains and knows about the state/failures.

The MCD still alerts, since that flow was unchanged, but the alert message was changed to look at MCC instead.

It probably makes more sense to move/change the alert entirely such that the MCC's drain controller is managing it, since the MCD no longer drains.

Also, both the MCC and MCD today error independently with some issues in terms of timing. We should also revisit how those errors propagate to the pool/CO status alongside the error

This would need to happen before https://issues.redhat.com/browse/MCO-88

Feature Overview

Agent-based installer requires to boot the generated ISO on the target nodes manually. Support for PXE booting will allow customers to automate their installations via their  DHCP/PXE infrastructure. 

This feature allows generating installation ISOs ready to add to a customer-provided DHCP/PXE infrastructure.

Goals

As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand

Why is this important?

We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.

Epic Goal

As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand

Why is this important?

We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Only parts of the epic AGENT-356 have landed (in particular, AGENT-438). We shouldn't ship it in a release in its current state due to lack of testing, as well as missing features like iPXE support (AGENT-491). At the moment, it is likely the PXE artifacts don't work at all because AGENT-510 is not implemented.

The agent create pxe-files subcommand should be disabled until the whole Epic is completed in a release.

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

Epic Goal

  • Install SNO within 10 minutes

Why is this important?

  • SNO installation takes around 40+ minutes.
  • This makes SNO less appealing when compared to k3s/microshift.
  • We should analyze the  SNO installation, figure our why it takes so long and come up with ways to optimize it

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. https://docs.google.com/document/d/1ULmKBzfT7MibbTS6Sy3cNtjqDX1o7Q0Rek3tAe1LSGA/edit?usp=sharing

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 This delay is caused due to 2 issues:
1. The waitForPredicate utility in assisted-installer-controller will first wait the given interval and only later attempt the predicate.

2. The operatorsMonitor will perform one extra iteration after it already updated the assisted-service that all operators are available.  

Description of problem:

When installing SNO with bootstrap in place CVO hangs for 6 minutes waiting for the lease

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.Run the POC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc
2. Observe the CVO logs post reboot
3.

Actual results:

I0102 09:45:53.131061       1 leaderelection.go:248] attempting to acquire leader lease openshift-cluster-version/version...
I0102 09:51:37.219685       1 leaderelection.go:258] successfully acquired lease openshift-cluster-version/version

Expected results:

Expected the bootstrap CVO to release the lease so that the CVO running post reboot won't have to wait the lease duration  

Additional info:

POC (hack) that remove the lease and allows CVO to start immediately:
https://github.com/openshift/installer/pull/6757/files#diff-f12fbadd10845e6dab2999e8a3828ba57176db10240695c62d8d177a077c7161R38-R48
  
Slack thread:
https://redhat-internal.slack.com/archives/C04HSKR4Y1X/p1673345953183709

Description of problem:

When installing SNO tt takes CVO about 2 minutes from the time all cluster operators are available (progressing=false and degraded=false) to set the clusterversion status to Available=true

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc)
2. monitor the cluster operators staus
 and the clusterversion
3.

Actual results:

All cluster operators are available about 2 minutes before CVO set the clsuter version to available=true

Expected results:

expected it to sync faster

Additional info:

attached must-gather logs
and full audit log

Description of problem:

As part of the efforts of improving the installation time of single node openshift, we've noticed the monitoring operator takes a long* time to finish installation.

It's hard for me to tell what exactly the monitoring operator is waiting for, but it becoming happy (as far as clusteroperator conditions are concerned) always seems to coincide with the operator finally realizing and reconciling** the 2 additional certificates inside the extension-apiserver-authentication that are being added by the apiserver operator. 

Usually this "realization" happens minutes after the two certs are being added, and ideally we'd like to cut back on that time, because sometimes those minutes lead to the monitoring operator being the last to roll out.

*Long time on the order of just a few minutes, which are not a lot but they add up. This ticket is one in a series of ticket we're opening for many other components

**The "marker" I use to know when this happened is when the monitoring operator, among other things, replaces the old prometheus-adapter-<hash_x> secret containing just the original certs of extension-apiserver-authentication with a new prometheus-adapter-<hash_y> which also contains the 2 new certs

Version-Release number of selected component (if applicable):

nightly 4.13 OCP

How reproducible:

100%

Steps to Reproduce:

1. Install single-node-openshift

Actual results:

Monitoring operator long delay reconciling extension-apiserver-authentication

Expected results:

Monitoring operator immediate reconciliation of extension-apiserver-authentication

Additional info:

Originally I suspected this might be due to api server downtime (which is a property of SNO), but this issue doesn't seem to correlate with API downtime

Feature Overview

To give Telco Far Edge customers as much of the product support lifespan as possible, we need to ensure that OCP releases are "telco ready" when the OCP release is GA.

Goals

  • All Telco Far Edge regression tests pass prior to OCP GA
  • All new features that are TP or GA quality at the time of the release pass validation prior to OCP GA
  • Ensure Telco Far Edge KPIs are met prior to OCP GA

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
     
     
     
     
     

(Optional) Use Cases

This Section:

  • SNO DU 
  • C-RAN Hub of DUs on compact cluster or traditional cluster
  • CU on compact cluster or traditional cluster

Questions to answer…

  • What are the scale goals?
  • How many nodes must be provisioned simultaneously?

Out of Scope

  • N/A

Background, and strategic fit

Notes

Assumptions

  •  

Customer Considerations

  • ...

Documentation Considerations

No documentation required

Now that we have a realtime kernel running as a blocking job, we need to get better CI signals to help identify realtime kernel errors early.

 

RT Lane Steps

https://steps.ci.openshift.org/workflow/openshift-upgrade-gcp-ovn-rt

Add the `rt-tests` package to our openshift-tests image to use during e2e tests when running on workers with realtime kernel enabled.

The point of this Epic is to look at failing or flaky tests in the single node e2e jobs with a focus on 4.10 and 4.11 releases. Any identified solutions that can be back ported further should be kept in this epic for easier tracking.

Targeted Jobs:
periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-single-node
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node

4.11 Sippy Results
4.10 Sippy Results

The over arching error that this shows up as in Prow is [sig-arch] events should not repeat pathologically, according to ci search and sippy this error occurs a lot for single node. Digging a bit deeper this is one of the events that repeats.

ns/openshift-etcd-operator deployment/etcd-operator - reason/ScriptControllerErrorUpdatingStatus

Links:
CI Search
CI Chart

Feature Goal

  • Definition of a CU Profile
  • Deployment of the CU profile on multi-node Bare Metal clusters using the RH declarative framework.

Why is this important?

  • Telcos will want minimal hands-on installs of all infrastructure.

Requirements

  1. CU infrastructure deployment and life-cycle management must be performed through the ZTP workflow toolset (SiteConfig, PolicyGen, ACM and ArgoCD)
  2. Performance tuning:
    • Non-RT kernel
    • Huge pages set per NUMA
  3. Day 2 operators:
    • SR-IOV network operator and sample configuration
    • OCS / ODF sample configuration, highly available storage
    • Cluster logging operator and sample configuration
  4. Additional features
    • Disk encryption (which?)
    • SCTP
    • NTP time synchronization
    • IPV4, IPV6 and dual stack options

Scenarios

  1. CU on a Three Node Cluster - zero touch provisioning and configuration
  2. CU can be on SNO, SNO+1 worker or MNO (up to 30 nodes)
  3. Cluster expansion
  4. y-stream and z-stream upgrade
  5. in-service upgrade (progressively update cluster)
  6. EUS to EUS upgrade

Acceptance Criteria

  • Reference configurations released as part of ZTP
  • Scenarios validated and automated (Reference Design Specification)
  • Lifecycle scenarios are measured and optimized
  • Documentation completed

Open questions::

  1. What kind of disk encryption is required?
  2. Should any work be done on ZTP cluster expansion?
  3. What KPIs must be met? CaaS CPU/RAM/disk budget KPIs/targets? Overall upgrade time, cluster downtime, number of reboot per node type targets? oslat/etc targets?

References:

  1. RAN DU/CU Requirements Matrix
  2. CU baseline profile 2020
  3. CU profile - requirements
  4. Nokia blueprints

https://docs.google.com/document/d/13Db7uChVx-2JXqAMJMexzHbhG3XLNLRy9nZ_7g9WbFU/edit#

Epic Goal

* Enable setting node labels on spoke cluster during installation

  • Right now we need to add roles, need to check if additional labels are required

Why is this important?

Scenarios

  1. ZTP flow user would like to mark nodes with additional roles, like rt, storage etc, in addition to master/worker that we have right now and supported by default

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Open questions::

  1. How master/worker roles are getting to the nodes, maybe we can use the same flow?
  2. Do we need to support only roles or in general supply labels?
  3. Another alternative is to use https://github.com/openshift/assisted-service/blob/d1cde6d398a3574bda6ce356411cba93c74e1964/swagger.yaml#L4071, a remark is that this will work only for day1

Feature Overview

Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.

Goals

  • Reduce CaaS platform compute needs so that it can fit within a single physical core with Hyperthreading enabled. (i.e. 2 CPUs)
  • Ensure existing DU Profile components fit within reduced compute budget.
  • Ensure existing ZTP, TALM, Observability and ACM functionality is not affected.
  • Ensure largest partner vDU can run on Single Core OCP.

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
 
Provide a mechanism to tune the platform to use only one physical core. 
Users need to be able to tune different platforms.  YES 
Allow for full zero touch provisioning of a node with the minimal core budget configuration.   Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. YES 
Platform meets all MVP KPIs   YES

(Optional) Use Cases

  • Main success scenario: A telecommunications provider uses ZTP to provision a vDU workload on Single Node OpenShift instance running on an Intel Sapphire Rapids platform. The SNO is managed by an ACM instance and it's lifecycle is managed by TALM.

Questions to answer...

  • N/A

Out of Scope

  • Core budget reduction on the Remote Worker Node deployment model.

Background, and strategic fit

Assumptions

  • The more compute power available for RAN workloads directly translates to the volume of cell coverage that a Far Edge node can support.
  • Telecommunications providers want to maximize the cell coverage on Far Edge nodes.
  • To provide as much compute power as possible the OpenShift platform must use as little compute power as possible.
  • As newer generations of servers are deployed at the Far Edge and the core count increases, no additional cores will be given to the platform for basic operation, all resources will be given to the workloads.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
    • Administrators must know how to tune their Far Edge nodes to make them as computationally efficient as possible.
  • Does this feature have doc impact?
    • Possibly, there should be documentation describing how to tune the Far Edge node such that the platform uses as little compute power as possible.
  • New Content, Updates to existing content, Release Note, or No Doc Impact
    • Probably updates to existing content
  • If unsure and no Technical Writer is available, please contact Content Strategy. What concepts do customers need to understand to be successful in [action]?
    • Performance Addon Operator, tuned, MCO, Performance Profile Creator
  • How do we expect customers will use the feature? For what purpose(s)?
    • Customers will use the Performance Profile Creator to tune their Far Edge nodes. They will use RHACM (ZTP) to provision a Far Edge Single-Node OpenShift deployment with the appropriate Performance Profile.
  • What reference material might a customer want/need to complete [action]?
    • Performance Addon Operator, Performance Profile Creator
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
    • N/A
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
    • Likely updates to existing content / unsure

Proposed title of this feature request

node-exporter collector customizations

What is the nature and description of the request?

We've had several requests and efforts in the past to change, add or drop certain collectors in our node-exporter statefulset. This feature requests intends to subsume the individual requests and channel these efforts into a cohesive feature that will serve the original requests and ideally address future needs.

We are going to add a section for Node Exporter into the CMO config. Each customizable collector has its dedicated subsection containing its own parameters under the NodeExporter section. The config map looks like the example below.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # disable a collector which is enabled by default
        netclass: 
          enabled: false
        # enable a collector which isn't enabled by default but supported by CMO.
        hwmon: 
          enabled: true
        # tweak the configuration of a collector
        netdev: 
          enabled: true
          ignoredDevices: 
            - br-.+
            - veth.*
          fooBar: xxx # unrecognized parameter will be ignored.
        # trying to disable unsupported collectors or turning off mandatory collectors will be ineffective too
        # as the keys wouldn't exist in the node exporter config struct.
        drbd: {enabled: true} # unsupported collectors will be ignored and logged as a warning
        cpu: {enabled: false} # necessary collectors will not turn off and this invalid config is logged as a warning.

No web UI changes are required, all settings are saved in CMO configmap.

Users should be informed of consequences of deactivating certain collectors, such as missing metrics for alerts and dashboards.

Why does the customer need this? (List the business requirements)

See linked issues for customer requests. Ultimately this will serve to add flexibility to our stack and serve customer needs better.

List any affected packages or components.

node-exporter, CMO config

We will add a section for "netdev" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is true.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # disable a collector which is enabled by default
        netdev: 
          enabled: false

Before implementing this feature, check the metrics from this collector is not necessary for alerts and dashboards.

We will add a section for "netclass" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is true.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # disable a collector which is enabled by default
        netclass: 
          enabled: false

Before implementing this feature, check the metrics from this collector is not necessary for alerts and dashboards.

We will add a section for "buddyinfo" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is false.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # enable a collector which is disabled by default
        buddyinfo: 
          enabled: true

refer to: https://issues.redhat.com/browse/OBSDA-44

 

Node Exporter has a new implementation of netclass collector. The new implementation uses netlink instead of sysfs to collect network device metrics, providing improved performance.
In order to activate it, we add a boolean parameter `netlinkImpl` for netclass collector of Node Exporter in CMO config. The default value is true, activating the netlink implementation to improve performance.
Here is an example of this new CMO config:

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        netclass: 
          enabled: true
          # enable netlink implementation of netclass collector.
          useNetlink: true
  

This config will add the argument `--collector.netclass.netlink` to the node exporter argument list.

ACCEPTANCE CRITERIA:
 
As a/an team managing OpenShift Container Platform
We want  to use the upstream openshift-installer code as much as possible without modification
So that  we do not have to duplicate efforts across OCP releases, including mimicking infrastructure changes from Terraform to ARM (ultimately generated by custom ARO golang code), and simplify the effort required for things like gap analysis, vendoring, and ultimately supporting new Y versions.
 
This work will include a complete audit of current patches: what they do, if they are really required, their impact on installing new clusters, their impacts to current ARO features, their interaction with Hive, their interaction with multi-version, their interaction with PUCM, and possibly others.

BREADCRUMBS:

  • ADR: N/A - needed before moving forward, should have wide feedback from the ARO team and openshift-installer team
  • Design Doc: N/A - needed after ADR approval, and should have wide feedback from the ARO team at large
  • Wiki: https://github.com/Azure/ARO-RP/blob/master/docs/upstream-differences.md explains the patching process
  • Similar Work/PRs: List of patches we cherry-picked for 4.11: https://github.com/jewzaam/installer-aro/commits/release-4.11-azure
  • Subject Matter Experts: Anyone that's done a revendor: Brendan Bergen, David Newman, and Matthew Barnes have done this recently, but also people who understand the cluster createOrUpdate code well. The slack channel #forum-installer would be a good connection to the OpenShift installer code owners.

Catalog all installer patches and identifying which ones are still required and which ones are no longer needed.

ARO currently patches the version of the SDK to be compatible with code in the RP. We'd rather not patch this, and instead see that version used upstream considering the minimal effort to change it.

ACCEPTANCE CRITERIA:

  • This ARO installer patch is merged into the openshift installer
  • The PR should be included in this JIRA, but also cataloged in the patches spreadsheet (currently in cell F20 - the `Adjust Scope` column for this patch)
  • There's no reason to modify the ARO installer at this time, we'll need to continue to carry the patch until we are using an OCP version that supports the carry-over.

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Epic Goal

  • Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.25

Why is this important?

  • Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

  • Existing CI/CD tests must be passing

This epic contains all the Dynamic Plugins related stories for OCP release-4.12

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

Update `i18next` to version 21.x.x and  `i18next-parser` to 6.x.x. Then tweak the pluralization rules in `set-english-defaults.js`  and address any compilation issues resulting from the update.

Currently the ConsolePlugins API version is v1alpha1. Since we are going GA with dynamic plugins we should be creating a v1 version.

This would require updates in console repository are using the new v1 version both in code, manifests and READMEs. 

This story is dependent on https://issues.redhat.com/browse/CONSOLE-3069

Place holder epic to track spontaneous task which does not deserve its own epic.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • Support manifest lists by image streams and the integrated registry. Clients should be able to pull/push manifests lists from/into the integrated registry. They also should be able to import images via `oc import-image` and them pull them from the internal registry.

Why is this important?

  • Manifest lists are becoming more and more popular. Customers want to mirror manifest lists into the registry and be able to pull them by digest.

Scenarios

  1. Manifest lists can be pushed into the integrated registry
  2. Imported manifests list can be pulled from the integrated registry
  3. Image triggers work with manifest lists

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Existing functionality shouldn't change its behavior

Dependencies (internal and external)

  1. ...

Previous Work (Optional)

  1. https://github.com/openshift/enhancements/blob/master/enhancements/manifestlist/manifestlist-support.md

Open questions

  1. Can we merge creation of images without having the pruner?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ACCEPTANCE CRITERIA

  • The ImageStream object should contain a new flag indicating that it refers to a manifest list
  • openshift-controller-manager uses new openshift/api code to import image streams
  • changing `importMode` of an image stream tag triggers a new import (i.e. updates generation in the tag spec)

NOTES

Epic Goal

  • Rebase the image registry onto Distribution 2.8.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Distribution 2.8

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

WHAT
Get approval from openshift monitoring team and open PR against cluster monitoring operator

HOW

TESTS
<List of related tests>

DONE
<bullet point items for what should be completed>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Collect on-prem installation data in order to be able to structure similar ELK dashboards as from SaaS deployments
  • Collect info of ZTP/CIM deployments
  • Collect info of BILLI deployements

Why is this important?

  • We want to track trends, and be able to analyze on-prem installations

Scenarios

  1. As a cluster administrator, I can provision and manage my fleet of clusters knowing that every data point is collected and sent to the Assisted Installer team without having to do anything extra. I know my data will be safe and secure and the team will only collect data they need to improve the product.
  2. As a developer on the assisted installer team, I can analyze the customer data to determine if a feature is worth implementing/keeping/improving. I know that the customer data is accurate and up-to-date. All of the data is parse-able and can be easily tailored to the graphs/visualizations that help my analysis.
  3. As a product owner, I can determine if the product is moving in the right direction based on the actual customer data. I can prioritize features and bug fixes based on the data.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. [Internal] MGMT-11244 Decision for which event streaming service used will determine the endpoint we send the data to

Previous Work (Optional):

 

 MGMT-11244: Remodeling of SaaS data pipeline

 

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

Allow users to set kernel parameters when they boot a machine into discovery.

Why is this important?

Some scenarios and environments require some additional parameters for the machines to boot.

Specifically we've seen issues recently where rd.net.timeout.carrier is required for either static networking to be applied or to successfully get a dhcp address before pulling the rootfs.

Scenarios

  • Full ISO image
  • Minimal ISO image
  • iPXE
  • Cloud/operator (when considering authentication from image service to assisted-service)

Previous Work (Optional):

  1. Coreos installer has commands to modify kargs, but using them would require significant changes in the image service and they would likely only be useful currently with the full iso ref: https://coreos.github.io/coreos-installer/cmd/iso/#coreos-installer-iso-kargs-modify

Open questions:

  1. How flexible should this be? Coreos installer has append, delete, replace, and reset.

Links

https://coreos.slack.com/archives/CUPJTHQ5P/p1657017280073669
https://coreos.slack.com/archives/C999USB0D/p1656558645775759?thread_ts=1656433137.481149&cid=C999USB0D

For this we'll need a way for users to provide kernel arguments (in both REST and kube APIs) and a place to store them (presumably on the infra-env).

The image service will need a way to query assisted-service for any kernel arg modifications by infra-env id. Whatever endpoint is used will need to support authentication usable by the image service (internal token-based auth of some sort depending on environment).

Modifying the kernel args in the iPXE script should be rather straight-forward.

The ISOs will be more complicated. Out of the box coreos reserves an area within the boot configuration for adding kernel parameters. This can be seen by unpacking one of the live images from mirror.openshift.com. Files at both /EFI/redhat/grub.cfg and /isolinux/isolinux.cfg will need to be edited

For example the grub.cfg file in the iso downloaded from https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.10/4.10.16/rhcos-4.10.16-x86_64-live.x86_64.iso has this entry:

menuentry 'RHEL CoreOS (Live)' --class fedora --class gnu-linux --class gnu --class os {
	linux /images/pxeboot/vmlinuz random.trust_cpu=on coreos.liveiso=rhcos-410.84.202205191234-0 ignition.firstboot ignition.platform.id=metal

	initrd /images/pxeboot/initrd.img /images/ignition.img
}

The offset and length of this embed area is likely stored in the iso file metadata. We could use that directly for the full iso, but would then need to restore this metadata when creating the minimal iso. Alternatively we could try to find the exact offset of this section on demand each time an image is requested.

Either way, an additional section to overwrite would be added to the stream reader with this offset and any kargs added by the user as content.

Epic Goal

  • Currently' the default value is OpenShiftSDN for IPv4 and OVNKubernetes for IPv6 - from 4.12 and above OVN should be the default network type - ocp won't block sdn, just change the default

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Customer encountered arping output that looks like this -

 

 

As you can see, the first mac address is different from the rest. This mac address seems to belong to a different, irrelevant machine on the network that is not supposed to be part of the cluster, but seems to hold that IP address anyway (i.e. - there's an IP conflict)

 

Our agent parsing of these logs lines leads to the l2 connectivity to the 192.168.150.74 host to be considered "not successful", due to that mac address not belonging to the inventory reported by the target host.

 

For the user, the lack of successful L2 connectivity manifests as the "No connectivity to the majority of hosts in the cluster" validation failing on some of the hosts.

This is not very helpful, as it doesn't help the user understand that they have IP conflicts.

Proposed improvement

Majority connectivity error message should go into detailed specifics about which connectivity is failing (L2 vs L3), which hosts / IP addresses and mac addresses are involved, why is it failing (e.g. mac address doesn't belong to any of the target host's interfaces) to help the user understand what's going wrong.

Easier improvement

An easier middle-ground solution should probably be just to add a small "check for IP conflicts in your network" note to the current validation message

 When there's packet loss between hosts, the user may be faced with this validaiton error:

 

Users often get confused by this and ask about what it means exactly. They are not aware we're running a mesh ping test between all hosts so this message may feel out of context for them.

Perhaps we should explain what exactly failed and give users the command they can run to see the packet loss themselves ("Log into host <blabla> and run ping ... bla bla bla IP").

 

Epic Goal

  • Assisted installer should give a formal infraenv REST API for adding additional certs to trust

Why is this important?

  • Users that install OCP on servers that communicate through transparent proxies must trust the proxy's CA for the communication to work
  • The only way users can currently do that is by using both infraenv ignition overrides and install-config overrides. These are generic messy APIs that are very error prone. We should give users a more formal, simpler API to achieve both at the same time. 

Scenarios

  1. Day 1 - discovery ISO OS should trust the bundles the user gives us as an infraenv creation param (either via REST or kube-api). A cluster formed from hosts should trust all certs from all infraenvs of all of its hosts combined.
  2. Day 2 - obviously we don't want to modify existing clusters to trust the cert bundles of infra-envs of hosts that want to join them, so we will simply not handle this case. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

1. Proposed title of this feature request

Support for software-defined RAID support with Assisted-installer

2. What is the nature and description of the request?

Software-defined RAID seems not recognized when installing OCP with Assisted-installer. It failed during the minimum hardware validation process. 

3. Why does the customer need this? (List the business requirements here)

To able to install the cluster using software raid disk when the customer installs the cluster using the assisted-installer method. 

4. List any affected packages or components.

Assisted-installer 

Modify the agent and service so that software RAID devices (/dev/md*) are presented and not blocked.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We need new api that will allow us to skip cluster/host validations.
  • This api should have it's own feature flag. 

Why is this important?

  • Some customer and partners has very specific HW that doesn't pass our validations and we want to allow them to it
  • Sometimes we have bugs in our validations that block people from installing and we don't want our partners to stuck cause of us

Scenarios

  1. Example from kaloom:
    1. Kaloom has very specific setup where vips can be shown as busy though installation can proceed with them.
    2. Currently they need to override vips in install config to be able to install cluster
    3. After adding the new api they can just run it and skip this specific validation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Feature flag for this api should be added to statistics calculator and if it was set cluster failure should not be counted.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

  Where the settings persisted by MGMT-13203 have been stored against the cluster, skip any of the indicated validations.

Moreover, when skipping a validation, an event should be emitted to indicate that this has occurred.

Finally, if any failure would have triggered a Triage ticket, this should be skipped in a similar fashion to https://github.com/openshift-assisted/assisted-installer-deployment/blob/463dc6612c923c46d47e80762c72651758a690d3/tools/create_triage_tickets.py#L64

As described in the enhancement doc

https://github.com/openshift/assisted-service/pull/4870

Three endpoints should be created:

`GET api/v2/cluster/<id>/skipped-validations`
Will return a JSON array of all the validations that are currently being skipped, format similar to

```

{ "host-skipped-validations": [ "no-ip-collisions-in-network", "sufficient-network-latency-requirement-for-role'" ], "cluster-skipped-validations": [ "network-prefix-valid", "api-vips-defined" ] }

```
The purpose of which is to fetch any skipped validations already defined for the cluster.

`POST api/v2/cluster/<id>/skipped-validations`

Essentially a document will be submitted via POST in JSON
```

{ "skipped-host-validations": [ "no-ip-collisions-in-network" ], "skipped-cluster-validations": [ "api-vips-defined" ] }

```
The purpose of which is to add the supplied list of validations to be skipped alongside any skipped validations already defined for the cluster.

PUT api/v2/cluster/<id>/skipped-validations

Essentially a document will be submitted via POST in JSON
```

{ "skipped-host-validations": [ "no-ip-collisions-in-network" ], "skipped-cluster-validations": [ "api-vips-defined" ] }

```

The purpose of which is to replace all currently skipped validations with the supplied list, it is also possible to submit an empty structure like the following...

```

{ "skipped-host-validations": [ ], "skipped-cluster-validations": [ ] }

```

Which would result in all skipped validations being cleared.

Each of the endpoints should update the settings of the cluster so that the skipped validations are updated in real time.

The API should reject any calls to these endpoints by users who are not permitted to access them. Using "Per customer feature enablement" https://github.com/openshift/assisted-service/blob/master/docs/dev/feature-per-customer.md

 

Epic Goal

  • Allow installing on FC disks
  • Warn when not using multipath

Why is this important?

  • Customer demand (they cannot use multipath because ODF doesn't support it)

Scenarios

  1. Customer needs to install on FC disks without multipath

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • When the discovered host doesn't have a specific name (e.g "localhost") we should auto rename it 
  • Rename string should derive from the host's main MAC address

Why is this important?

  • To avoid validations (forbidden name and uniqueness)from failing

Scenarios

  1. The discovered host does have a name he got from the DHCP server - We don't change the name
  2. The discovered host doesn't have a name - We change the hostname to a generic name that passes the name validations 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Enable replacing master as day2 operation

Why is this important?

  • There is currently no simple way to replace a failing control plane. The IPI install method has network infrastructure requirements that the assisted installer does not, and the UPI method is complex enough that the target user for the assisted installer may not feel comfortable doing it.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Deprecate allow-add-workers endpoint and create new allow-add-hosts endpoint that does the same thing.

Now that MGMT-8578 is being implemented, the old name doesn't make much sense, as moving a cluster into the adding hosts state doesn't restrict it to adding just workers

For disconnected environments, it may be required for `PUBLIC_CONTAINER_REGISTRIES` to be updated in order to not require public registries to be in the pull-secret configs. This is currently possible only by updating the `unsupported.agent-install.openshift.io/assisted-service-configmap` configmap.

 

The user experience is not ideal in this case as this step suggests using an unsupported workflow to configure a required config. Ideally, assisted should be able to automatically set this configuration for disconnected environments, especially when the local mirror is configured.

 

Goal:

  • Automatically add mirrored registries to `PUBLIC_CONTAINER_REGISTRIES` so that auth configs are not required.

Details will follow...

Related

Things to consider

  • we are adding new fields in REST and Kube APIs
  • we are potentially deprecating existing fields in REST and Kube APIs
  • we are changing DB schema by adding new tables `api_vips` and `ingress_vips` and migrating data out of the `clusters` table from the `api_vip` and `ingress_vip` columns (we have done the same manoeuvre when creating `cluster_networks` table)

Goal

With the new way of storing the data, we need to pre-populate database so that the new tables `api_vips` and `ingress_vips` get the data for all the existing clusters filled. This is in order to simplify the handling of all the operations as well as to allow a deprecation of the singular fields in the future.

Designing the migration properly will enable us to support all the upgrade paths for Assisted, no matter which environment it runs on.

Related past work

For the POST payload creating a dual-stack SNO

    {
              "additional_ntp_source": "clock.redhat.com,clock2.redhat.com",
              "base_dns_domain": "qe.lab.redhat.com",
              "cluster_networks": [
                  {
                      "cidr": "10.128.0.0/14",
                      "host_prefix": 23
                  },
                  {
                      "cidr": "fd01::/48",
                      "host_prefix": 64
                  }
              ],
              "high_availability_mode": "None",
              "machine_networks": [
                  {
                      "cidr": "192.168.123.0/24"
                  },
                  {
                      "cidr": "fd2e:6f44:5dd8::/64"
                  }
              ],
              "name": "ocp-cluster-edge33-0",
              "network_type": "OVNKubernetes",
              "openshift_version": "4.11",
              "pull_secret": "{}",
              "service_networks": [
                  {
                      "cidr": "172.30.0.0/16"
                  },
                  {
                      "cidr": "fd02::/112"
                  }
              ],
              "ssh_public_key": "",
              "vip_dhcp_allocation": false
          }
      }

the plural fields are not populated as expected, i.e. the GET response is

    {
        "additional_ntp_source": "clock.redhat.com,clock2.redhat.com",
        "ams_subscription_id": "2IJaXTpAahPFcgpnCVZiM7R9rQo",
        "api_vip": "192.168.123.150",
        "api_vips": [],
        "base_dns_domain": "qe.lab.redhat.com",
        "cluster_networks": [
            {
                "cidr": "10.128.0.0/14",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493",
                "host_prefix": 23
            },
            {
                "cidr": "fd01::/48",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493",
                "host_prefix": 64
            }
        ],
        "connectivity_majority_groups": "{\"192.168.123.0/24\":[],\"IPv4\":[],\"IPv6\":[],\"fd2e:6f44:5dd8::/64\":[]}",
        "controller_logs_collected_at": "2022-12-01T14:41:28.462Z",
        "controller_logs_started_at": "2022-12-01T14:40:58.119Z",
        "cpu_architecture": "x86_64",
        "created_at": "2022-12-01T13:42:59.39193Z",
        "deleted_at": null,
        "disk_encryption": {
            "enable_on": "none",
            "mode": "tpmv2"
        },
        "email_domain": "redhat.com",
        "enabled_host_count": 1,
        "feature_usage": "{\"Additional NTP Source\":{\"data\":{\"source_count\":2},\"id\":\"ADDITIONAL_NTP_SOURCE\",\"name\":\"Additional NTP Source\"},\"Dual-stack\":{\"id\":\"DUAL-STACK\",\"name\":\"Dual-stack\"},\"Hyperthreading\":{\"data\":{\"hyperthreading_enabled\":\"all\"},\"id\":\"HYPERTHREADING\",\"name\":\"Hyperthreading\"},\"OVN network type\":{\"id\":\"OVN_NETWORK_TYPE\",\"name\":\"OVN network type\"},\"SNO\":{\"id\":\"SNO\",\"name\":\"SNO\"}}",
        "high_availability_mode": "None",
        "host_networks": [
            {
                "cidr": "192.168.123.0/24",
                "host_ids": [
                    "c4488ccb-6c87-4dad-ad08-f12d592f7848"
                ]
            },
            {
                "cidr": "fd2e:6f44:5dd8::/64",
                "host_ids": [
                    "c4488ccb-6c87-4dad-ad08-f12d592f7848"
                ]
            }
        ],
        "hosts": [
            {
                "bootstrap": true,
                "checked_in_at": "2022-12-01T14:04:32.475Z",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493",
                "created_at": "2022-12-01T13:55:25.572259Z",
                "deleted_at": null,
                "discovery_agent_version": "registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:latest",
                "disks_info": "{\"/dev/disk/by-id/wwn-0x05abcd571306b13b\":{\"disk_speed\":{\"tested\":true},\"path\":\"/dev/disk/by-id/wwn-0x05abcd571306b13b\"}}",
                "domain_name_resolutions": "{\"resolutions\":[{\"domain_name\":\"api.ocp-cluster-edge33-0.qe.lab.redhat.com\",\"ipv4_addresses\":[\"192.168.123.5\"],\"ipv6_addresses\":[\"fd2e:6f44:5dd8::5\"]},{\"domain_name\":\"api-int.ocp-cluster-edge33-0.qe.lab.redhat.com\",\"ipv4_addresses\":[],\"ipv6_addresses\":[]},{\"domain_name\":\"console-openshift-console.apps.ocp-cluster-edge33-0.qe.lab.redhat.com\",\"ipv4_addresses\":[\"192.168.123.150\"],\"ipv6_addresses\":[]},{\"domain_name\":\"validateNoWildcardDNS.ocp-cluster-edge33-0.qe.lab.redhat.com\",\"ipv4_addresses\":[],\"ipv6_addresses\":[]}]}",
                "href": "/api/assisted-install/v2/infra-envs/dbc8b22f-8ad0-4338-81f2-24f595ec4377/hosts/c4488ccb-6c87-4dad-ad08-f12d592f7848",
                "id": "c4488ccb-6c87-4dad-ad08-f12d592f7848",
                "images_status": "{\"quay.io/openshift-release-dev/ocp-release:4.11.17-x86_64\":{\"download_rate\":46.081211839377055,\"name\":\"quay.io/openshift-release-dev/ocp-release:4.11.17-x86_64\",\"result\":\"success\",\"size_bytes\":402075414,\"time\":8.725365457},\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:00dc61290f10ec21a547731460f067f957e845eb7a6cc9e29044c73a62a41e04\":{\"download_rate\":79.17343602484354,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:00dc61290f10ec21a547731460f067f957e845eb7a6cc9e29044c73a62a41e04\",\"result\":\"success\",\"size_bytes\":506886444,\"time\":6.402228695},\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2cf7da4892d34247d7135930021739023a85c635c946649252151e0279686abb\":{\"download_rate\":95.42367677462869,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2cf7da4892d34247d7135930021739023a85c635c946649252151e0279686abb\",\"result\":\"success\",\"size_bytes\":487635588,\"time\":5.110215876},\"registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-rhel8:latest\":{\"download_rate\":45.82888987054992,\"name\":\"registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-rhel8:latest\",\"result\":\"success\",\"size_bytes\":389861420,\"time\":8.506892074}}",
                "infra_env_id": "dbc8b22f-8ad0-4338-81f2-24f595ec4377",
                "installation_disk_id": "/dev/disk/by-id/wwn-0x05abcd571306b13b",
                "installation_disk_path": "/dev/sda",
                "installer_version": "registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-rhel8:latest",
                "inventory": "{\"bmc_address\":\"0.0.0.0\",\"bmc_v6address\":\"::/0\",\"boot\":{\"current_boot_mode\":\"bios\"},\"cpu\":{\"architecture\":\"x86_64\",\"count\":16,\"flags\":[\"fpu\",\"vme\",\"de\",\"pse\",\"tsc\",\"msr\",\"pae\",\"mce\",\"cx8\",\"apic\",\"sep\",\"mtrr\",\"pge\",\"mca\",\"cmov\",\"pat\",\"pse36\",\"clflush\",\"mmx\",\"fxsr\",\"sse\",\"sse2\",\"ss\",\"syscall\",\"nx\",\"pdpe1gb\",\"rdtscp\",\"lm\",\"constant_tsc\",\"arch_perfmon\",\"rep_good\",\"nopl\",\"xtopology\",\"cpuid\",\"tsc_known_freq\",\"pni\",\"pclmulqdq\",\"vmx\",\"ssse3\",\"fma\",\"cx16\",\"pdcm\",\"pcid\",\"sse4_1\",\"sse4_2\",\"x2apic\",\"movbe\",\"popcnt\",\"tsc_deadline_timer\",\"aes\",\"xsave\",\"avx\",\"f16c\",\"rdrand\",\"hypervisor\",\"lahf_lm\",\"abm\",\"3dnowprefetch\",\"cpuid_fault\",\"invpcid_single\",\"ssbd\",\"ibrs\",\"ibpb\",\"stibp\",\"ibrs_enhanced\",\"tpr_shadow\",\"vnmi\",\"flexpriority\",\"ept\",\"vpid\",\"ept_ad\",\"fsgsbase\",\"tsc_adjust\",\"bmi1\",\"avx2\",\"smep\",\"bmi2\",\"erms\",\"invpcid\",\"mpx\",\"avx512f\",\"avx512dq\",\"rdseed\",\"adx\",\"smap\",\"clflushopt\",\"clwb\",\"avx512cd\",\"avx512bw\",\"avx512vl\",\"xsaveopt\",\"xsavec\",\"xgetbv1\",\"xsaves\",\"arat\",\"umip\",\"pku\",\"ospke\",\"avx512_vnni\",\"md_clear\",\"arch_capabilities\"],\"frequency\":2095.076,\"model_name\":\"Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz\"},\"disks\":[{\"by_id\":\"/dev/disk/by-id/wwn-0x05abcd571306b13b\",\"by_path\":\"/dev/disk/by-path/pci-0000:00:04.0-scsi-0:0:0:0\",\"drive_type\":\"HDD\",\"has_uuid\":true,\"hctl\":\"0:0:0:0\",\"id\":\"/dev/disk/by-id/wwn-0x05abcd571306b13b\",\"installation_eligibility\":{\"eligible\":true,\"not_eligible_reasons\":null},\"model\":\"QEMU_HARDDISK\",\"name\":\"sda\",\"path\":\"/dev/sda\",\"serial\":\"05abcd571306b13b\",\"size_bytes\":141733920768,\"smart\":\"SMART support is:     Unavailable - device lacks SMART capability.\\n\",\"vendor\":\"QEMU\",\"wwn\":\"0x05abcd571306b13b\"},{\"bootable\":true,\"by_path\":\"/dev/disk/by-path/pci-0000:00:04.0-scsi-0:0:0:3\",\"drive_type\":\"ODD\",\"has_uuid\":true,\"hctl\":\"0:0:0:3\",\"id\":\"/dev/disk/by-path/pci-0000:00:04.0-scsi-0:0:0:3\",\"installation_eligibility\":{\"not_eligible_reasons\":[\"Disk is removable\",\"Disk is too small (disk only has 1.1 GB, but 20 GB are required)\",\"Drive type is ODD, it must be one of HDD, SSD, Multipath, FC.\"]},\"is_installation_media\":true,\"model\":\"QEMU_CD-ROM\",\"name\":\"sr0\",\"path\":\"/dev/sr0\",\"removable\":true,\"serial\":\"drive-scsi0-0-0-3\",\"size_bytes\":1135607808,\"smart\":\"SMART support is:     Unavailable - device lacks SMART capability.\\n\",\"vendor\":\"QEMU\"}],\"gpus\":[{\"address\":\"0000:00:02.0\"}],\"hostname\":\"master-0-0\",\"interfaces\":[{\"flags\":[\"up\",\"broadcast\",\"multicast\"],\"has_carrier\":true,\"ipv4_addresses\":[\"192.168.123.150/24\"],\"ipv6_addresses\":[\"fd2e:6f44:5dd8::62/64\"],\"mac_address\":\"52:54:00:a6:ac:74\",\"mtu\":1500,\"name\":\"ens3\",\"product\":\"0x0001\",\"speed_mbps\":-1,\"type\":\"physical\",\"vendor\":\"0x1af4\"}],\"memory\":{\"physical_bytes\":34359738368,\"physical_bytes_method\":\"dmidecode\",\"usable_bytes\":33706176512},\"routes\":[{\"destination\":\"0.0.0.0\",\"family\":2,\"gateway\":\"192.168.123.1\",\"interface\":\"ens3\",\"metric\":100},{\"destination\":\"10.88.0.0\",\"family\":2,\"interface\":\"cni-podman0\"},{\"destination\":\"192.168.123.0\",\"family\":2,\"interface\":\"ens3\",\"metric\":100},{\"destination\":\"::1\",\"family\":10,\"interface\":\"lo\",\"metric\":256},{\"destination\":\"fd2e:6f44:5dd8::62\",\"family\":10,\"interface\":\"ens3\",\"metric\":100},{\"destination\":\"fd2e:6f44:5dd8::\",\"family\":10,\"interface\":\"ens3\",\"metric\":100},{\"destination\":\"fe80::\",\"family\":10,\"interface\":\"cni-podman0\",\"metric\":256},{\"destination\":\"fe80::\",\"family\":10,\"interface\":\"ens3\",\"metric\":1024},{\"destination\":\"::\",\"family\":10,\"gateway\":\"fe80::5054:ff:fe89:3f0\",\"interface\":\"ens3\",\"metric\":100}],\"system_vendor\":{\"manufacturer\":\"Red Hat\",\"product_name\":\"KVM\",\"virtual\":true},\"tpm_version\":\"none\"}",
                "kind": "Host",
                "logs_collected_at": "2022-12-01T14:04:24.790Z",
                "logs_info": "completed",
                "logs_started_at": "2022-12-01T14:04:23.649Z",
                "ntp_sources": "[{\"source_name\":\"time.cloudflare.com\",\"source_state\":\"synced\"},{\"source_name\":\"time.cloudflare.com\",\"source_state\":\"unreachable\"},{\"source_name\":\"time.cloudflare.com\",\"source_state\":\"unreachable\"},{\"source_name\":\"time.cloudflare.com\",\"source_state\":\"not_combined\"},{\"source_name\":\"ntp2.ntp-001.prod.iad2.dc.redhat.com\",\"source_state\":\"unreachable\"}]",
                "progress": {
                    "current_stage": "Done",
                    "installation_percentage": 100,
                    "stage_started_at": "2022-12-01T14:15:22.822Z",
                    "stage_updated_at": "2022-12-01T14:15:22.822Z"
                },
                "progress_stages": [
                    "Starting installation",
                    "Installing",
                    "Waiting for bootkube",
                    "Writing image to disk",
                    "Rebooting",
                    "Done"
                ],
                "registered_at": "2022-12-01T13:55:25.568Z",
                "requested_hostname": "master-0-0",
                "role": "master",
                "stage_started_at": "0001-01-01T00:00:00.000Z",
                "stage_updated_at": "0001-01-01T00:00:00.000Z",
                "status": "installed",
                "status_info": "Done",
                "status_updated_at": "2022-12-01T14:15:22.822Z",
                "suggested_role": "master",
                "timestamp": 1669903472,
                "updated_at": "2022-12-01T14:15:22.823598Z",
                "user_name": "assisted-installer-qe-ci2",
                "validations_info": "{\"hardware\":[{\"id\":\"has-inventory\",\"status\":\"success\",\"message\":\"Valid inventory exists for the host\"},{\"id\":\"has-min-cpu-cores\",\"status\":\"success\",\"message\":\"Sufficient CPU cores\"},{\"id\":\"has-min-memory\",\"status\":\"success\",\"message\":\"Sufficient minimum RAM\"},{\"id\":\"has-min-valid-disks\",\"status\":\"success\",\"message\":\"Sufficient disk capacity\"},{\"id\":\"has-cpu-cores-for-role\",\"status\":\"success\",\"message\":\"Sufficient CPU cores for role master\"},{\"id\":\"has-memory-for-role\",\"status\":\"success\",\"message\":\"Sufficient RAM for role master\"},{\"id\":\"hostname-unique\",\"status\":\"success\",\"message\":\"Hostname master-0-0 is unique in cluster\"},{\"id\":\"hostname-valid\",\"status\":\"success\",\"message\":\"Hostname master-0-0 is allowed\"},{\"id\":\"sufficient-installation-disk-speed\",\"status\":\"success\",\"message\":\"Speed of installation disk is sufficient\"},{\"id\":\"compatible-with-cluster-platform\",\"status\":\"success\",\"message\":\"Host is compatible with cluster platform none\"},{\"id\":\"vsphere-disk-uuid-enabled\",\"status\":\"success\",\"message\":\"VSphere disk.EnableUUID is enabled for this virtual machine\"},{\"id\":\"compatible-agent\",\"status\":\"success\",\"message\":\"Host agent compatibility checking is disabled\"},{\"id\":\"no-skip-installation-disk\",\"status\":\"success\",\"message\":\"No request to skip formatting of the installation disk\"},{\"id\":\"no-skip-missing-disk\",\"status\":\"success\",\"message\":\"All disks that have skipped formatting are present in the host inventory\"}],\"network\":[{\"id\":\"machine-cidr-defined\",\"status\":\"success\",\"message\":\"No Machine Network CIDR needed: User Managed Networking\"},{\"id\":\"belongs-to-machine-cidr\",\"status\":\"success\",\"message\":\"Host belongs to all machine network CIDRs\"},{\"id\":\"belongs-to-majority-group\",\"status\":\"success\",\"message\":\"Host has connectivity to the majority of hosts in the cluster\"},{\"id\":\"valid-platform-network-settings\",\"status\":\"success\",\"message\":\"Platform KVM is allowed\"},{\"id\":\"container-images-available\",\"status\":\"success\",\"message\":\"All required container images were either pulled successfully or no attempt was made to pull them\"},{\"id\":\"sufficient-network-latency-requirement-for-role\",\"status\":\"success\",\"message\":\"Network latency requirement has been satisfied.\"},{\"id\":\"sufficient-packet-loss-requirement-for-role\",\"status\":\"success\",\"message\":\"Packet loss requirement has been satisfied.\"},{\"id\":\"has-default-route\",\"status\":\"success\",\"message\":\"Host has been configured with at least one default route.\"},{\"id\":\"api-domain-name-resolved-correctly\",\"status\":\"success\",\"message\":\"Domain name resolution for the api.ocp-cluster-edge33-0.qe.lab.redhat.com domain was successful or not required\"},{\"id\":\"api-int-domain-name-resolved-correctly\",\"status\":\"success\",\"message\":\"Domain name resolution for the api-int.ocp-cluster-edge33-0.qe.lab.redhat.com domain was successful or not required\"},{\"id\":\"apps-domain-name-resolved-correctly\",\"status\":\"success\",\"message\":\"Domain name resolution for the *.apps.ocp-cluster-edge33-0.qe.lab.redhat.com domain was successful or not required\"},{\"id\":\"dns-wildcard-not-configured\",\"status\":\"success\",\"message\":\"DNS wildcard check was successful\"},{\"id\":\"non-overlapping-subnets\",\"status\":\"success\",\"message\":\"Host subnets are not overlapping\"}],\"operators\":[{\"id\":\"cnv-requirements-satisfied\",\"status\":\"success\",\"message\":\"cnv is disabled\"},{\"id\":\"lso-requirements-satisfied\",\"status\":\"success\",\"message\":\"lso is disabled\"},{\"id\":\"lvm-requirements-satisfied\",\"status\":\"success\",\"message\":\"lvm is disabled\"},{\"id\":\"odf-requirements-satisfied\",\"status\":\"success\",\"message\":\"odf is disabled\"}]}"
            }
        ],
        "href": "/api/assisted-install/v2/clusters/872e4039-2d4b-4f35-96e8-5a741b4c0493",
        "hyperthreading": "all",
        "id": "872e4039-2d4b-4f35-96e8-5a741b4c0493",
        "ignition_endpoint": {},
        "image_info": {
            "created_at": "0001-01-01T00:00:00Z",
            "expires_at": "0001-01-01T00:00:00.000Z"
        },
        "ingress_vip": "192.168.123.150",
        "ingress_vips": [],
        "install_completed_at": "2022-12-01T14:40:06.383Z",
        "install_started_at": "2022-12-01T13:56:49.723Z",
        "kind": "Cluster",
        "logs_info": "completed",
        "machine_networks": [
            {
                "cidr": "192.168.123.0/24",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493"
            },
            {
                "cidr": "fd2e:6f44:5dd8::/64",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493"
            }
        ],
        "monitored_operators": [
            {
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493",
                "name": "console",
                "operator_type": "builtin",
                "status": "available",
                "status_info": "All is well",
                "status_updated_at": "2022-12-01T14:34:52.779Z",
                "timeout_seconds": 3600
            },
            {
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493",
                "name": "cvo",
                "operator_type": "builtin",
                "status": "available",
                "status_info": "Done applying 4.11.17",
                "status_updated_at": "2022-12-01T14:34:52.948Z",
                "timeout_seconds": 3600
            }
        ],
        "name": "ocp-cluster-edge33-0",
        "network_type": "OVNKubernetes",
        "ocp_release_image": "quay.io/openshift-release-dev/ocp-release:4.11.17-x86_64",
        "openshift_cluster_id": "1a01e65d-eddd-4073-ba99-60205ff3c279",
        "openshift_version": "4.11.17",
        "org_id": "16020201",
        "platform": {
            "type": "none"
        },
        "progress": {
            "finalizing_stage_percentage": 100,
            "installing_stage_percentage": 100,
            "preparing_for_installation_stage_percentage": 100,
            "total_percentage": 100
        },
        "pull_secret_set": true,
        "schedulable_masters": false,
        "schedulable_masters_forced_true": true,
        "service_networks": [
            {
                "cidr": "172.30.0.0/16",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493"
            },
            {
                "cidr": "fd02::/112",
                "cluster_id": "872e4039-2d4b-4f35-96e8-5a741b4c0493"
            }
        ],
        "ssh_public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCfNfFoyKBFvzvLSyz4QKntDvK7QHYG8BApPYL1XXTQCVUSxjJ/54xPlwuhB4gpf8giiknfQFavklxFyW1Pfmz7JlI7oHlfi1Imq7YEUya5UB70LUdY4dPjaAi5RJK3cNJYPiL36+cqaK/+KlM2/+gmAE4JzOp0vHEO44xxCtlFBsT5LMDiDaP/AEB7NsGNrEuwfMn1EmMqFGrwo9RzGMjO+2TCESuJcoK8gHt2RrW2F4BehwMD6PmipKF7aTUOpo5yAJg+R6s5sXFjg5QQBHy128DOngJ7SlRkLMAYKynRv3kWkVBog2VfP6BY49xng6J1DAkw2SlINnKw8hulZCNJ",
        "status": "installed",
        "status_info": "Cluster is installed",
        "status_updated_at": "2022-12-01T14:40:06.383Z",
        "total_host_count": 1,
        "updated_at": "2022-12-01T14:41:28.623232Z",
        "user_managed_networking": true,
        "user_name": "assisted-installer-qe-ci2",
        "validations_info": "{\"configuration\":[{\"id\":\"pull-secret-set\",\"status\":\"success\",\"message\":\"The pull secret is set.\"}],\"hosts-data\":[{\"id\":\"all-hosts-are-ready-to-install\",\"status\":\"success\",\"message\":\"All hosts in the cluster are ready to install.\"},{\"id\":\"sufficient-masters-count\",\"status\":\"success\",\"message\":\"The cluster has the exact amount of dedicated control plane nodes.\"}],\"network\":[{\"id\":\"api-vips-defined\",\"status\":\"success\",\"message\":\"API virtual IPs are not required: User Managed Networking\"},{\"id\":\"api-vips-valid\",\"status\":\"success\",\"message\":\"API virtual IPs are not required: User Managed Networking\"},{\"id\":\"cluster-cidr-defined\",\"status\":\"success\",\"message\":\"The Cluster Network CIDR is defined.\"},{\"id\":\"dns-domain-defined\",\"status\":\"success\",\"message\":\"The base domain is defined.\"},{\"id\":\"ingress-vips-defined\",\"status\":\"success\",\"message\":\"Ingress virtual IPs are not required: User Managed Networking\"},{\"id\":\"ingress-vips-valid\",\"status\":\"success\",\"message\":\"Ingress virtual IPs are not required: User Managed Networking\"},{\"id\":\"machine-cidr-defined\",\"status\":\"success\",\"message\":\"The Machine Network CIDR is defined.\"},{\"id\":\"machine-cidr-equals-to-calculated-cidr\",\"status\":\"success\",\"message\":\"The Cluster Machine CIDR is not required: User Managed Networking\"},{\"id\":\"network-prefix-valid\",\"status\":\"success\",\"message\":\"The Cluster Network prefix is valid.\"},{\"id\":\"network-type-valid\",\"status\":\"success\",\"message\":\"The cluster has a valid network type\"},{\"id\":\"networks-same-address-families\",\"status\":\"success\",\"message\":\"Same address families for all networks.\"},{\"id\":\"no-cidrs-overlapping\",\"status\":\"success\",\"message\":\"No CIDRS are overlapping.\"},{\"id\":\"ntp-server-configured\",\"status\":\"success\",\"message\":\"No ntp problems found\"},{\"id\":\"service-cidr-defined\",\"status\":\"success\",\"message\":\"The Service Network CIDR is defined.\"}],\"operators\":[{\"id\":\"cnv-requirements-satisfied\",\"status\":\"success\",\"message\":\"cnv is disabled\"},{\"id\":\"lso-requirements-satisfied\",\"status\":\"success\",\"message\":\"lso is disabled\"},{\"id\":\"lvm-requirements-satisfied\",\"status\":\"success\",\"message\":\"lvm is disabled\"},{\"id\":\"odf-requirements-satisfied\",\"status\":\"success\",\"message\":\"odf is disabled\"}]}",
        "vip_dhcp_allocation": false
    }

Important point here is that this is SNO. For SNO user must never provide VIPs himself, as well as API VIP == Ingress VIP and both point towards SNO's IP address. Given that this IPs are not set in ClusterCreate, I think it's something small and silly that we have missed in the initial implementation.

As described in the enhancement, in the Phase 2 of the lifecycle of the feature, there is relatively sophisticated requirement related to using dual-stack VIPs, i.e.

If dual-stack VIPs are to be used, the `api_vips` must contain the desired configuration (the obvious part) but at the same time the `api_vip` must contain the value that matches `api_vips[0]` (the non-obvious part).

This comes from the guidelines of deprecating fields in upstream kubernetes and is not further discussable.

We have a dual-stack generic feature flag. Given that dual-stack and dual-stack VIPs have different compatibility matrix, we want to have a separate feature flag for the VIPs.

This will also make implementation of validations easier, as we can just refer to the map holding versions & features to check if something is supported or not, instead of implementing the same condition manually at multiple places.

The feature is available upstream only starting from 4.12; given that our API is version-agnostic, we need to have the following behaviour

  • if `api_vips` with >1 entry is provided and OCP version is pre-4.12, an error should be thrown
  • if `api_vips` with exactly 1 entry is provided and OCP version is pre-4.12, we should install successfuly

Goal

After the change introduced in https://github.com/openshift/installer/pull/5798, we can now pass multiple VIPs to the install-config.yaml

Deprecation process

Different from the kubernetes process, o/installer is not following the phaseout. This means that starting from 4.12 we should be passing only `apiVIPs` and `ingressVIPs`. The previous fields are deprecated immediately.

Goal

Dual stack requires selecting both IPv4 and IPv6 values for

  • API VIP
  • Ingress VIP

At the same time it should be still possible to select either only IPv4, or only IPv6 values (single stack).

Handle in the assisted API and Kube API (CRDs).

Related past work

API version bump

With introducing aforementioned changes in the KubeAPI, we need to release a new version.

As the new fields in KubeAPI are not mandatory, the release of new version is not needed.

Implementation detail

Like in the past, we should introduce 2 new tables in the DB

  • api_vips
  • ingress_vips

that will store the value in subject as well as link back to `cluster_id` using the foreign key. In gorm that would be equivalent to

gorm:"foreignkey:ClusterID;association_foreignkey:ID"

with the full definition of the field being something like

  api_vip:
    type: object
    properties:
      cluster_id:
        type: string
        format: uuid
        x-go-custom-tag: gorm:"primary_key;foreignkey:Cluster"
      ip:
        $ref: '#/definitions/ip'
        x-go-custom-tag: gorm:"primary_key"

We should also introduce a new type representing IP address, like

  ip:
    type: string
    x-go-custom-tag: gorm:"primaryKey"
    pattern: '^(?:(?:(?:[0-9]{1,3}\.){3}[0-9]{1,3})|(?:(?:[0-9a-fA-F]*:[0-9a-fA-F]*){2,}))?$'

For backward compatibility, the following state machine validations should support a single VIP (API and Ingress).

Relevant validators, with their new names:

api-vip-defined     --> api-vips-defined
api-vip-valid       --> api-vips-valid
ingress-vip-defined --> ingress-vips-defined
ingress-vip-valid   --> ingress-vips-valid

 

Current State

As for today, for dual-stack clusters users are required to pass both Machine Networks explicitly. This is because there is only IPv4 VIP, therefore we could technically calculate the 1st subnet, but we have no way of knowing the desired subnet for the 2nd network. For the sake of simplicity, we decided not to perform any auto-calculation.

Goal

With support of dual-stack VIP, the aforementioned limitation gets removed. Knowing VIPs from both stacks and having the hosts' inventories, we can proceed with calculation of both Machine Networks. Therefore, for any scenario where both VIPs are provided, we can stop accepting Machine Networks from the user.

Unknowns

With implementation of this epic, users will still have a choice of 1 or 2 VIPs for dual-stack setups. This means that in a scenario of dual-stack with a single VIP, we still need to require both Machine Networks.

Epic Goal

  • Follow up on https://issues.redhat.com/browse/MON-2209
  • Develop a notion of optional scrape profiles for service monitors and handle them in CMO
  • Give users the option to influence the number of metrics the in-cluster stack collects
  • Improve CMOs scaling behavior in very small and very large environments

Why is this important?

  • In some environments CMO exhibits bad behavior. In single node environments its one of the main consumers of available resource budgets, in very large clusters memory usage requires very large nodes to keep up.
  • Not all metrics collected are strictly necessary for functionality of other components. These can optionally be dropped if the admin is not interested in them.

Scenarios

  1. A single node deployment always wants to minimize resource usage and the admin might want to choose to collect as few metrics as possible.
  2. A user with many clusters often has an existing monitoring setup and wants to spend as few resource on OpenShift internal monitoring as possible. We want to give them an option to scrape as few metrics as possible.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The ServiceMonitors that CMO deploys implement the chosen profile set.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MON-2209

Open questions::

  1.  

The user interface would give users the option to specify a profile they want CMO to scrape. The set of possible profiles will be pre-defined by us.

If this new option is used, CMO populates the [pod|service]MonitorSelector to select resources that carry the requested profile, probably as a label with the respective value (label name tbd, lets call it the profile label for now), and monitors that do not have the label set at all. So monitors will be picked from two sets: a monitor with the profile label and the requests label value and all monitors without the profile label present (additionally to the current namespace selector).

After this it is up to the ServiceMonitors to implement the scrape profiles. Without any change to the ServiceMonitors, even after setting a profile in the CMO config, things should work as they did before. When a ServiceMonitor owner wants to implement scrape profiles, they needs to provide ServiceMonitors for all profiles and no unlabeled ServiceMonitor. If a profile label is not used, this ServiceMonitor will not be scraped at all for a given profile.

Let's say that we support 3 scrape profiles:

  • "full" (same as today)
  • "operational" (only collect metrics for recording rules and dashboards)
  • "uponly" (collect the up metric only and none of the exposed metrics)

When the cluster admin enables the "operational" profile, the k8s Prometheus resource would be

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: k8s
  namespace: openshift-monitoring
spec:
  serviceMonitorSelector:
    matchExpressions:
    - key: monitoring.openshift.io/scrape-profile
      operator: NotIn
      values:
      - "full"
      - "uponly"

An hypothetical component that want to support the scrape profiles would need to provision 3 service monitors for each service (1 service monitor per profile).

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    monitoring.openshift.io/scrape-profile: full
  name: foo-full
  namespace: openshift-bar
spec:
  endpoints:
    port: metrics
  selector:
    matchLabels:
      app.kubernetes.io/name: foo
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    monitoring.openshift.io/scrape-profile: operational
  name: foo-operational
  namespace: openshift-bar
  metricRelabelings:
  - sourceLabels: [__name__]
    action: keep
    regex: "requests_total|requests_failed_total"
spec:
  endpoints:
    port: metrics
  selector:
    matchLabels:
      app.kubernetes.io/name: foo
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    monitoring.openshift.io/scrape-profile: uponly
  name: foo-uponly
  namespace: openshift-bar
spec:
  endpoints:
    port: metrics
  metricRelabelings:
  - sourceLabels: [__name__]
    action: drop
    regex: ".+"
  selector:
    matchLabels:
      app.kubernetes.io/name: foo

 

A component that doesn't need/want to adopt scrape profile should be scraped as before irrespective of the configured scrape profile.

 

AI

  • Demonstrate that the proposed implementation actually works.

1 Proposed title of this feature request

Allow to specify secrets/configmaps in alertmanager statefullset of cluster monitoring.

2. What is the nature and description of the request?
Allow to add secrets (or additionally configMaps) in alertmanager configuration of cluster or user workloads monitoring.

3. Why does the customer need this? (List the business requirements here)

To configure a receiver that needs, for instance, a private CA or a basic authentication with "password_file":

REF: https://prometheus.io/docs/alerting/latest/configuration/#http_config

4. List any affected packages or components.

Alertmanager

1 Proposed title of this feature request

Allow to specify secrets/configmaps in alertmanager statefullset of cluster monitoring.

2. What is the nature and description of the request?
Allow to add secrets (or additionally configMaps) in alertmanager configuration of user workloads monitoring.

3. Why does the customer need this? (List the business requirements here)

To configure a receiver that needs, for instance, a private CA or a basic authentication with "password_file":

REF: https://prometheus.io/docs/alerting/latest/configuration/#http_config

4. List any affected packages or components.

Alertmanager

Allow to specify secrets/configmaps in alertmanager statefullset of cluster monitoring.

2. What is the nature and description of the request?
Allow to add secrets (or additionally configMaps) in alertmanager configuration of cluster platform monitoring.

3. Why does the customer need this? (List the business requirements here)

To configure a receiver that needs, for instance, a private CA or a basic authentication with "password_file":

REF: https://prometheus.io/docs/alerting/latest/configuration/#http_config

4. List any affected packages or components.

Alertmanager

Epic Goal

  • Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.26

Why is this important?

  • Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

  • Existing CI/CD tests must be passing

User Story

As a developer i want to have my testing and build tooling managed in a consistent way for reduce amount of context switches during doing a maintenance work. 

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see: 

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.

Stakeholders

  • Cluster infra team

Definition of Done

  • All Cluster Infra Team owned repos updated and uses consistent pattern for auxiliary tools management
    • REPO LIST TBD, raw below
    • MAPI providers
    • MAO
    • CCCMO
    • CMA
  • Testing
  • Existing tests should pass
  • additional test for checking envtest version should be introduced

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see:

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see:

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.

User Story

As a developer i want to have most recent version of testing framework with all fancy features like Junit reporting

Background

We are widely using ginkgo across our components, v1 was deprecated sometime ago, need to update this.

Steps

  • Migrate ginkgo from v1 to v2 across Cluster Infra team repos
  • Enable Junit reporting everywhere
  • Enable coverage everywhere

Stakeholders

  • Cluster infra team

Definition of Done

  • Gingko updated in 
    • REPOS LIST TBD, raw:
    • MAPI providers
    • MAO
    • CCCMO
    • CMA
    • Autoscaler operator
  • Junit reporting works, and available in our CI runs
  • Code coverage works and reports are available in our CI runs
  • Docs
  • -
  • Testing
  • Currents unit tests should pass

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.

This will include updating the hack scripts to make sure that:

  • JUnit reports are being created correctly
  • Code coverage reports are generated

Description of problem:

Enable default sysctls for kubelet.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Overview

Today in the Dev Perspective navigation, we have pre-pinned resources in the "third section" of the vertical navigation. Customers have asked if there is a way for admins to to provide defaults for those pre-pinned resources for all users.

Acceptance Criteria

  1. Add a section to the Cluster/Console menu for Developer pre-pinned resources
  2. Admin can define which nav items exist by default for all users & is able to re-order them
  3. These defaults should be used as the default pre-pinned items for all new users
  4. Users who's nav items have not been customized will inherit these defaults the next time they log in. 

Exploration Results

Miro Board

Slack Channel

tbd

Description

As an admin, I want to show the pre-pinned resources on the developer perspective navigation to new users based on the customization.

Acceptance Criteria

  1. The pre-pinned resources defined by the admin should be used as the default pre-pinned items for all new users
  2.  Users who have not customized the nav items will inherit these defaults the next time they log in, else they will see the customized nav items saved in the user-settings

Additional Details:

Description

As an admin, I want to be able to use the existing form-driven experience to define the default pre-pinned resources.

Acceptance Criteria

  1. Add the `Pre-pinned navigation items` section under the `Developer` tab in the Customization form for setting the default pre-pinned resources
  2. Use the DualListSelector component. The list on the left should show the list of all the resources that can be added to the developer perspective navigation and the list on the right should have the resource(s) that the admin wants to show to all new users by default if the navigation is not customized
  3. Hide the Developer tab when the Developer perspective is disabled
  4. The console configuration CR should be updated as per the selected resource(s)
  5. Add intro text to provide a bit more discoverability in the following areas:
    1. Pre-pinned navigation items
    2. Add page
    3. Developer catalog

Additional Details:

Description

As an admin, I should be able to see a code snippet that shows how define pre-pinned resources on the developer perspective navigation

Based on the https://issues.redhat.com/browse/ODC-7181 enhancement proposal, the cluster admins can define the pre-pinned resources

To support the cluster-admin to configure the pre-pinned resources, the developer console should provide a code snippet for the customization of yaml resource (Console CRD).

Customize Perspective Enhancement PR: 

Acceptance Criteria

  1. When the admin opens the Console CRD there is a snippet in the sidebar which provides a default YAML which supports the admin to define pre-pinned resources

Additional Details:

Previous work:

  1. https://issues.redhat.com/browse/ODC-5080
  2. https://issues.redhat.com/browse/ODC-5449

Description

As an admin, I want to define the pre-pinned resources on the developer perspective navigation

Based on the https://issues.redhat.com/browse/ODC-7181 enhancement proposal, it is required to extend the console configuration CRD to enable the cluster admins to configure this data in the console resource

Acceptance Criteria

  1. Extend the "customization" spec type definition for the CRD in the openshift/api project

Additional Details:

Previous customization work:

  1. https://issues.redhat.com/browse/ODC-5416
  2. https://issues.redhat.com/browse/ODC-5020
  3. https://issues.redhat.com/browse/ODC-5447

Problem:

Address outstanding usability issues as well as implement some RFEs

Acceptance criteria:

  1. Allow users to see which pods are receiving traffic
    1. In Resources tab of side panel when showing pods
    2. In Pods tab of related resources ( Service, Deployment, Deployment Config )
    3. In Pods list view in context of a namespace OR in All Namespaces, the user should be able to select this column through the managed column component
    4. If a Pod is receiving traffic, note that on the Pod detail page
  2. Update Helm terminology to use Create/Delete Helm Release
    1. CTA should Create on Helm Chart side panel
    2. Title of form should be Create Helm Release
    3. CTA of form should be Create
    4. Update the empty state of the Helm Releases page
    5. Update the Uninstall CTA to be Delete
    6. Update the Uninstall Helm Chart flow to be Delete Helm Release

Dependencies (External/Internal)

None

Exploration

Miro board

Notes

Description

As a user, I want to see the overview of the pods receiving traffic

Acceptance Criteria

  1. In the Resources tab of the topology side panel when showing pods
  2. In the Pods tab of related resources ( Service, Deployment, Deployment Config)
  3. In Pods list view in context of a namespace OR in All Namespaces, the user should be able to select this column through the managed column component
  4. If a Pod is receiving traffic, show that on the Pod details page

Additional Details:

Spike

Acceptance Criteria

  1. Update Helm terminology to use Create/Delete Helm Release
  2. CTA should Create on the Helm Chart side panel
  3. The title of the form should be Create Helm Release
  4. CTA of the form should be Create
  5. Update the empty state of the Helm Releases page
  6. Update the Uninstall CTA to be Delete
  7. Update the Uninstall Helm Chart flow to Delete Helm Release
  8. Please verify if any test case/e2e scenario needs to be updated

Additional Details:

Miro board

Note:
Reach out to UX for the text on the Helm Releases empty state

Here is our tech debt backlog: https://issues.redhat.com/browse/ODC-6711

See included tickets, we are trying to clean up some of our tech debt with this epic for 4.13.

Description

packages/dev-console/src/components/edit-deployment/EditDeployment.tsx is now also used for the creation flow, so the name is confusing.

The both URLs are defined in frontend/packages/dev-console/src/plugin.tsx, but doesn't match our common URLs.

For pinned resources, the Create and Edit Flows don't mark the nav item as active.

  {
    type: 'Page/Route',
    properties: {
      exact: true,
      path: ['/edit-deployment/ns/:ns', '/create-deployment/ns/:ns/~new/form'],
      loader: async () =>
        (
          await import(
            './components/edit-deployment/EditDeploymentPage' /* webpackChunkName: "dev-console-edit-deployment" */
          )
        ).default,
    },
  },

Acceptance Criteria

  1. Rename edit-deployment folder to deployment ( or deployments ? )
  2. Rename EditDeployment* components to Deployment* (except EditDeployment.tsx for moment)
  3. Rename utils/create-deployment-* utils/edit-deployment-* to utils/deployment-*
  4. Change routes from /create-deployment/... /edit-deployment/... to
    /k8s/ns/:ns/deployment/~new/form and
    /k8s/ns/:ns/deployment/:name/form
    /k8s/ns/:ns/deploymentconfigs/~new/form and
    /k8s/ns/:ns/deploymentconfigs/:name/form
  5. Convert the route from packages/dev-console/src/plugin.tsx to packages/dev-console/console-extensions.json

Additional Details:

Problem:

There's no way in the UI to visualize
Topology support for SB Label Selector Implementation

Goal:

Visualize the Service Binding between a service with a label selector that matches workloads with have labels with the same name as a Service Binding Connector.

Use cases:

  1. Users should be able to visualize service bindings that support label selector

Acceptance criteria:

As a user that has implemented Service binding via label selectors, I should be able to

  1. Visualize the connector in the topology view
  2. In Topology, the label selector associated with a SB should be shown in the side panel when a SB is selected
  3. The label selector associated with a SB should be shown in the SB details page
  4. Clicking on the Label Selector associated with the SB (in side panel or details view) should navigate to a list of all the connected resources (need more info re: which resources : D, DC - maybe KSVC? Helm Releases?)
  5. When deleting a Service Binding connector, the user should see a delete confirmation dialog which explains that all connectors will be deleted. (user should have ability to continue with Delete or Cancel out.). This message will be different than deleting the initial implementation of Service Binding resource

Dependencies (External/Internal):

  1. Patrick Knight will be doing the UI implementation, ODC will still provide a epic owner for guidance

Exploration:

Miro board

Notes

Service Binding Specs for K8S

APPSVC-978 SBO Label Selector Support in ODC

Description

As a user, I would like to see the binding connector between the source and target nodes in topology for service-binding implemented via label selectors

Acceptance Criteria

  1. Binding connector should be visualised for all application nodes that has the label specified in the label selector of the service binding CR.
  2. When deleting a Binding connector, the user should see a delete confirmation dialog which explains that all connectors will be deleted.

Additional Details:

Miro

Goal:

As a developer, I want the pipeline included in my repo (pipeline-as-code) to be configured automatically when I import my application from Git in dev console, so that I can import my application to OpenShift in one go.

Currently the developer would have an extra BuildConfig or pipeline generated for them which are not needed if they have already a pipeline in their Git repo. They also have to import their Git repo once more through the pipeline views in order to configure pipelines-as-code for the pipeline that already exists in their repo.

Why is it important?

Reduce the extra steps for onboarding an application from Git repo which already has pipeline-as-code in their repo (e.g. re-importing the application from a different cluster).

Acceptance criteria:

  1. When developer imports an application from Git in Dev Console, if a pipeline exists in the .tekton directory, pipeline-as-code gets configured for that application
    1. No buildconfig should be created for the app
    2. "Add pipelineline" checkbox should NOT be displayed in the import form

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, when importing an application via import from git flow and the git repo has .tekton directory I want to be able to configure PAC for my application.

Acceptance Criteria

  1. Remove Pipeline-as-Code from the import strategy section
  2. If Pipelines operator is installed and PAC is enabled and .tekton dir is detected in the git repo then show Add Pipelines checkbox checked by default.
  3. Below the checkbox show two radio buttons as Configure PAC (checked by default) and Add pipeline from this cluster(should be same as what we have today). 
  4. If Pipelines operator is installed and either PAC is not enabled or .tekton dir is not detected in the git repo then the UI should look like what we have today.
  5. Show application field in general section as today
  6. Show resources section as today
  7. Show advanced options as today
  8. On create an imagestream, workload(depending on the resource that is selected), buildconfig, repository and other resources (service/route) should be created like we do today
  9. On create the user should navigate to the topology view as today

Additional Details:

We also need to show some additional details like workload-name and imageURL if the user opts in to configure PAC. UX is not yet finalised for this. 

Radio button labels are also not yet finalised.

Follow up with PM for the above info.

Description

As a user, when importing an application via import from git flow and the git repo already has a pipeline in .tekton directory I want PAC to get automatically configured for my application.

Acceptance Criteria

  1. When user enters the repo url check for pipeline in .tekton directory of the repo
  2. If a pipeline exists and a repository CR is already available for the repo show a message to the user and disable the create button
  3. If a pipeline exists and a repository CR is not available for the repo update the import strategy
  4. Add the configuration options of the Add git repo form to import from git form under the import strategy section
  5. If a github app is already configured for the cluster show both `Use github app` and `Setup a webhook` options and if not then show only `Setup a webhook` option
  6. Hide application field from general section
  7. Hide resources section
  8. Hide pipelines section
  9. Hide advanced options
  10. When user clicks on create, create repo CR and navigate the user to the repo details page
  11. This feature should only be available when Pipelines operator is installed and PAC is enabled

Additional Details:

Miro - https://miro.com/app/board/uXjVOoSAPbA=/?moveToWidget=3458764537704937720&cot=14

Exploration is not yet complete.

UX is not yet decided.

This might be a separate form and we might need to add a new card on the add page ?

Problem:

Currently users are experiencing a number of problems when installing some helm charts.  In some cases, installation fails.  In other cases, the UI says there is an error, but eventually the Helm Release is created.  We are seeing similar issues when helm releases are being uninstalled/deleted.

Goal:

Identify the issues with the helm chart installation process and determine possible solutions.

Why is it important?

Users aren't always able to create or delete Helm Releases, due to a timeout.  The error messages are not helpful, and there is no way for the users to proceed.

Acceptance criteria:

  1. Helm chart creation should be done asynchronous. If workloads are created:
    1. Navigate directly to the Topology view, create the Helm Release grouping
    2. The Helm Release grouping must have a visual queue to indicate the status of the Helm Release
  2.  Helm chart creation should be asynchronous. If no workloads are created:
    1. Navigate directly to the Helm Releases view
    2. Add a column to indicate the status of the Helm Release
  3.  Helm deletion should be asynchronous.  When deleting a Helm Release, the modal should be dismissed immediately so that users can continue with their tasks.

NOTES
Currently Helm Releases which do not include workloads are not show in the Topology List/Graph view, but ARE shown in the Helm Releases view.

Dependencies (External/Internal):

  • Helm POC results & and necessary APIs needs to be available by Dec 30th in order to be delivered in 4.13

Exploration:

Miro board

Note:

  • Helm is doing a POC, let's result evaluate the results before we commit for this to 4.13
  • Exploration, Spikes & dependencies need to be complete by end of Milestone 1 (Dec 30th)
  • Some Helm Charts don't actually create any workloads, so we need to investigate how we would handle that with the new flow **

Owner: Architect:

Kartikey Mamgain

Story (Required)

As an OpenShift user i should be able to install/upgrade Helm Releases asynchronously and return the secret name along with 201 status code. The action of install/upgrade should continue to work in background.

Background (Required)

Helm ODC frontend makes call to the helm backend and waits for the install call to finish. In most cases this is OK, however as chart are becoming more complex and contain more dependencies, it is becoming possible that the install call will not be done before browser times out of GET request. To solve this problem we can treat helm install/upgrade as asynchronous operation. Instead of returning the release information to frontend we would be returning the secret name which can then be tracked to obtain the status.

Glossary

<List of new terms and definition used in this story>

Out of scope

Frontend Changes

E2E Changes 

In Scope

Backend code changes

  • Unit Test on actions package and handlers
  • api change on install release and upgrade release endpoint
  • oc helm cli changes

Approach(Required)

We would need to modify the install/upgrade endpoint to return the secret which is getting created to track the Helm release. As soon as Helm Release is getting installed/upgraded a secret get's created. This secret has labels as owner equal to helm, release name and the release revision. We need to return this secret name to ui.

Demo requirements(Required)

Demo with oc-helm plugin (install, upgrade and list)

Dependencies

None

Edge Case

None

Acceptance Criteria

  • We should be able to install/upgrade complex Helm Charts which take longer time to install. Try to install https://artifacthub.io/packages/helm/gitlab/gitlab/6.5.4 Helm chart which takes over a few minutes to install.
  • The release should be shown in pending-install state in list API call using oc-helm.
  • We should be able to demo the above mentioned scenario of a release changing getting listed as Pending install and moving to the deployed state. We can use gitlab Helm Chart for reference.
  • Verify the changes with oc helm plugin and make changes to oc helm plugin to incorporate the changes to response body.

Development:Yes

QE:
Documentation: No

Upstream: Not
Applicable

Downstream: Not
Applicable

Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

v

Legend

Unknown

Verified

Unsatisfied

Description

As a user, I should not see a timeout error if a helm chart takes a long time to install.

Acceptance Criteria

  1. Use the api `/api/helm/release/async` for installing and upgrading helm charts
  2. When creating a Helm Release:
    1. if there are no workloads to be generated by a release's chart then, navigate directly to the Helm Release details page
    2. otherwise navigate the user to the topology page with the release notes(if exists) tab opened in the side panel
  3. Provide helm release grouping with creation status
  4. Disable "Upgrade" and "Rollback" options from the context menu and action dropdown on the side panel

Additional Details:

 

Notes:
Talk to PM/UX on how do we visualize the helm release status on topology graph view ?

 

Description

As a user, I want to see the correct status on the list page and on the details page when a helm release is installed.

Acceptance Criteria

  1. Show the status of the helm release in the correct format on the helm releases list page and on the details page next to the release name and on the revision history page
  2. Add status information under the details tab
  3. Disable the "Upgrade" and "Rollback" options in the kebab menu in the list page and Actions dropdown on the details page when the installation is pending

Additional Details:

https://helm.sh/docs/helm/helm_status/

Owner: Architect:

Kartikey Mamgain

Story (Required)

As an OpenShift user i should be able to uninstall Helm Releases asynchronously and return the secret name along with 201 status code. The action of uninstall should continue to work in background. In developer sandbox we do see the api's are taking a longer time and hence we are getting timeout errors too.

Background (Required)

Helm ODC frontend makes call to the helm backend and waits for the uninstall call to finish. In most cases this is OK, however as chart are becoming more complex and contain more dependencies, it is becoming possible that the uninstall call will not be done before browser times out of DELETE request. To solve this problem we can treat helm uninstall as asynchronous operation. Instead of returning the release information to frontend we would be returning the secret name which can then be tracked to obtain the status.

Glossary

<List of new terms and definition used in this story>

Out of scope

Frontend Changes

E2E Changes 

In Scope

Backend code changes

  • Unit Test on actions package and handlers
  • api change on uninstall release endpoint.
  • oc helm cli changes

Approach(Required)

We would need to modify the uninstall endpoint to return the secret which is getting created to track the Helm release. This secret has labels as owner equal to helm, release name and the release revision. We need to return this secret name to ui.

IW would be adding an endpoint /api/helm/release/async DELETE to console backend.

The command to uninstall would run in a goroutine while we return the secret to the ui. We do have one dependency from ui where in the revision of the release should be sent to fetch the secret. This step can be covered at time of integration.

We would need to add an option to oc helm cli as uninstall-async.

Demo requirements(Required)

Demo with oc-helm plugin .

Dependencies

None

Edge Case

None

Acceptance Criteria

  • We should be able to uninstall complex Helm Charts which take longer time to uninstall. Try to uninstall https://artifacthub.io/packages/helm/gitlab/gitlab/6.5.4 Helm chart which takes over a  minute to uninstall.
  • The release should be shown in uninstalling state in list API call using oc-helm.
  • We should be able to demo the above mentioned scenario of a release changing getting listed as uninstalling and getting deleted eventually from the list releases page. We can use gitlab Helm Chart for reference.
  • Verify the changes with oc helm plugin and make changes to oc helm plugin to incorporate the changes to response body.

Development:Yes

QE:
Documentation: No

Upstream: Not
Applicable

Downstream: Not
Applicable

Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

v

Legend

Unknown

Verified

Unsatisfied

Problem

Provide a seamless experience for a developer to provide the code repo where their serverless function is defined, and have it deployed onto OpenShift, and visualized as a serverless function in Topology.

Acceptance Criteria

  1. User can create a serverless function through the Import from Git flow, using the S2I flow.
  2. The Create Serverless flow should use the same import from git flow, with the following exceptions
    1. Do not show the Resource types
    2. Change title 
  3. This method is only available for the shipped runtimes, except for GO
  4. Add a new Serverless functions card to Add page to support this new flow, displayed on an independent card (no title/header/grouping)
  5. In existing Import from Git flow, allow the user to change Resource type if they wish (let's verify we want this) 
  6. Allow users to view and possibly edit the environment variables which are shown in func.yaml
  7. Users are able to visually identify the serverless function in the topology view (function.knative.dev label must be added to KSVC)
  8. Add Telemetry
    1. Serverless functions card
    2. Upon clicking create, identify that a serverless function is being created

Spikes

  1. ODC team will start with a spike & share the results with the larger ODC/Serverless team.  Need to use the func.yaml
  2. ODC team will investigate how to handle the environment variables which are stored in the func.yaml & share the results with the larger ODC/Serverless team & make sure this is something they we would want to support.

MIRO

Questions

  • If func.yaml is in the git repo, annotate it properly ... and don't show Resource Options, it should be Serverless Deployment
  • Need to determine how this would work with customer provided builder images

Description

As a user, I want to import a Git repository with func.yaml and create a Serverless function

Acceptance Criteria

  1. Detect func.yaml in Git repo when user enter the Git URL
  2. Read runtime, builder, buildEnvs and run envs value from the func.yaml. func.yaml eg. - https://github.com/vikram-raj/hello-func-node/blob/master/func.yaml
  3. Proceed with this flow if the builder value is s2i
  4. Allow users to view and possibly edit the environment variables which are shown in func.yaml. Add the buildEnvs and run envs to the resource 
  5. if func.yaml is detected change the strategy to Serverless function and do not show builder images in to and  provide a section which allows the user to select the runtime version
  6. Do not show the Pipeline and Resources section. Default to Serverless Deployment
  7. Add label function.knative.dev: 'true' to KSVC to visualize the Serverless function in the Topology

Additional Details:

MIRO
Serverless function doc - https://docs.openshift.com/dedicated/serverless/functions/serverless-functions-yaml.html#serverless-functions-func-yaml-environment-variables_serverless-functions-yaml

SPIKE doc https://docs.google.com/document/d/1O0lP0UMIMxJT2ja8t78DahuqdWuUHgl0gGvyj5NgIxE/edit?pli=1#

 

ODF LVMO is moving to the VE team in OCP 4.13 and will no longer be part of the ODF product. The operator will be renamed toLogical Volume Manager Storage.

In order to make the user experience seamless between ODF LVMO 4.12 and Logical Volume Manager Storage in 4.13, it has been decided to change the name of the operator in 4.12 to Logical Volume Manager Storage.

 

Changes required:

  1. CSV Fields (name, description, displayName etc)
  2. Labels: odf-lvm-provisioner
  3. storageclass and volumesnapshotclass names
  4. SCC names
  5. Documentation
  6. image names for odf-lvm-operator, odf-topolvm, odf-lvmo-must-gather

 We need to update it both in BE & UI

LVMO = logical volume manager operator
LVMS = logical volume manager storage

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a SRE, I want hypershift operator to expose a metric when hosted control plane is ready. 

This should allow SRE to tune (or silence) alerts occurring while the hosted control plane is spinning up. 

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The Kube APIServer has a sidecar to output audit logs. We need similar sidecars for other APIServers that run on the control plane side. We also need to pass the same audit log policy that we pass to the KAS to these other API servers.

Epic Goal

  • Ensure that expected number of Windows nodes in a cluster exist in a usable state.

Why is this important?

  • As an OpenShift cluster administrator, I expect that the number usable Windows nodes is always equal to the number I have have specified. I do not want to have to concern myself with keeping track of the state of the nodes myself, checking if they have entered an unusable state.
  • Windows nodes should be resilient and not require manual intervention to fix small issues.

Scenarios

  1. A Windows node has a Kubernetes node binary crash. A controller running on the Node will recognize this and work to return the Windows node to a working state. If this is not possible, an event is generated alerting the cluster administrator to the issue.
  2. A Windows node configured from a Machine enters an unrecoverable state. Remediation of the node can be left to user defined MachineHealthChecks.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Engineering Details

  • This should take the form of a Windows service, replacing WMCB.
  • The expected state of Windows services on a node should be given by a ConfigMap managed by WMCO.
  • This epic will result in much of the work WMCO is doing to be moved onto the nodes being configured. This will allow for WMCO to scale better as the amount of Windows nodes in a cluster rises.

Dependencies (internal and external)

  1. ...

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description

WICD logs should be collected by must-gather runs to facilitate customer issue debugging.

Acceptance Criteria

  • WICD logs are collected by must-gather

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • As an Agent team member, I need a stable CI so that I can test and merge my work quickly and easily.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Review / Investigate what is causing CI to break
  • Internal
    • Create more debug information to help with troubleshooting
    • Break down CI into smaller components to allow addition of failure conditions
  • External
    • Work with others to determine how often tests run and request changes as necessary
    • For items which have been consistently broken, reach out to ART to get tests adjusted to be optional
    •  
  • Prioritize and add current backlog items to CI Coverage in every sprint

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In case it should be used for publishing artifacts in CI jobs.

Look into to see if the following things are leaked:

  • pull secret
  • ssh key
  • potentially values in journal logs

 [WIP]

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

A new metric [1] was introduced in Kubernetes 1.23 for the kube-apiserver latency which doesn't take into account the time spent processing webhooks. This is meant to give a more accurate view of the SLO of the kube-apiserver and as such we should switch to it.

Note that this metrics was renamed in 1.26 [2] and the original one was deprecated and will be removed in 1.27.

[1] https://github.com/kubernetes/kubernetes/pull/105890
[2] https://github.com/kubernetes/kubernetes/pull/112679

Story: As an OpenShift admin managing multiple disconnected clusters, I want to be able to disable oc-mirrors pruning behavior to able to retain images in the registry in the event related operator releases were accidentally deleted by the maintainers upstream.

Acceptance criteria:

  • a global switch to disable oc-mirror pruning images that are no longer referenced in the catalog that / release stream that is supposed to be mirrored

This epic contains all the Dynamic Plugins related stories for OCP release-4.14 and implementing Core SDK utils.

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

Background

AC:

Place holder epic to track spontaneous task which does not deserve its own epic.

AC:

We have connectDirectlyToCloudAPIs flag in konnectiviy socks5 proxy to dial directly to cloud providers without going through konnectivity.

This introduce another path for exception https://github.com/openshift/hypershift/pull/1722

We should consolidate both by keep using connectDirectlyToCloudAPIs until there's a reason to not.

 

DoD:

If change a NodePool from having .replicas to autoscaler min/Max and set a min beyond the current replicas, that might leave the machineDeployment in a state not suitable to be autoscalable. This require the consumer to ensure the min is <= current replicas which is poor UX. We should be able to automate this ideally

DoD:

This feature is supported by ROSA.

To have an e2e to validate publicAndPrivate <-> Private in the presubmits.

ServicePublishingStrategy of type LoadBalancer or Route could specify the same hostname, which will result on one of the services not being published. i.e. no DNS records created.
context: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1678287502260289
 
DOD:
Validate ServicePublishingStrategy and report conflicting services hostnames

Once the HostedCluster and NodePool gets stopped using PausedUntil statement, the awsprivatelink controller will continue reconciling.

 

How to test this:

  • Deploy a private cluster
  • Put it in pause once deployed
  • Delete the AWSEndPointService and the Service from the HCP namespace
  • And wait for a reconciliation, the result it's that they should not be recreated
  • Unpause it and wait for recreation.

AWS has a hard limit of 100 OIDC providers globally. 
Currently each HostedCluster created by e2e creates its own OIDC provider, which results in hitting the quota limit frequently and causing the tests to fail as a result.

 
DOD:
Only a single OIDC provider should be created and shared between all e2e HostedClusters. 

DoD:

At the moment if the input etcd kms encryption (key and role) is invalid we fail transparently.

We should check that both key and role are compatible/operational for a given cluster and fail in a condition otherwise

Definition of done

In SaaS, allow users of assisted-installer UI or API, to install any published OCP version out of a supported list of x.y options.

Feature Origin

Feature probably origins from our own team. This feature will enhance the current workflow we're following to allow users selectively install versions in assisted-installer SaaS.

Until now we had to be contacted by individual users to allow a specific version (usually, it was replaced by us with a newer version). In this case, we would add this version to the relevant configuration file.

Feature usage

It's not possible to quantify the relevant numbers here, because users might be missing certain versions in assisted and just give up the usage of it. In addition, it's not possible to know if users intended to use a certain "old" version, or if it's just an arbitrary decision.

Osher De Paz can we know how many requests we had for "out-of-supported-list"?

Feature availability

It's essential to include this feature in the UI. Otherwise, users will get very confused about the feature parity between API and UI.

Osher De Paz there will always be features that exist in the API and not in the UI. We usually show in the UI features that are more common and we know that users will be interacting with them.

Why is this important?

  • We need a generic way of using a specific OCP version on the cloud and also for other platforms by the user

Scenarios

  1. We will need to add validation that the images for this version exist before installation.
  2.  

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Not sure how we will handle this requirement in a disconnected environment
  2.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: Test Plan
  • QE - Manual execution of the feature - done
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Allow all the components of assisted to work correctly when they are running on a different cluster from the one where our CRs exist.

Why is this important?

  • Support the effort in https://issues.redhat.com/browse/ACM-1369
  • When handling huge numbers of managed clusters, the number of objects in etcd becomes a bottleneck. To solve this, multiple hypershift control planes will be deployed just to store CRs which will define additional (real) clusters to deploy.

Epic Goal

  • Have each host run a service as soon as possible after booting from disk and contact the assisted-service to:
    • Change the installation stage to a new "Booted" stage (between Configuring and Joined).
    • Upload logs once the host installation has progressed or failed.  This can be known either locally or by contacting the assisted-service.  The logs should include journal, network config, and any other useful logs.
  • The service must be cleaned up once it has completed its work (either by cleaning itself or by using some other mechanism - MCO, assisted-controller, etc.

Why is this important?

  • There are cases where the host pulls ignition but doesn't get to a point where the assisted-controller can run and provide us with more information.  This is especially painful when debugging SNO installations.

Scenarios

  1. SNO
  2. Multi-node
    1. Bootstrap
    2. Non-bootstrap master
    3. Worker

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Currently, we are polling events from assisted service, enriching the events, and pushing it to elastic in event scrape service.
    In order to support also sending events from On-Prem environments - we need to remodel the data pipelines towards push-based model. Since we'll benefit from this approach in SaaS environment as well, we'll seek for a model as unified as possible

Why is this important?

  • Support on-prem environments
  • Increase efficiency (we'll stop performing thousands of requests per minute to the SaaS)
  • Enhance resilience (right now if something fails, we have a relatively short time window to fix it before we lose data)

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Make a decision on what design to implement (internal)
  2. Authorization with pull-secret (TBD, is there a ticket for this? Oved Ourfali )
  3. RH Pipelines RHOSAK implementation

Previous Work (Optional):

  1. First analysis
  2. We then discussed the topic extensively: Riccardo Piccoli Igal Tsoiref Michael Levy liat gamliel Oved Ourfali Juan Hernández 
  3. We explored already existing systems that would support our needs, and we found that RH Pipelines almost exactly matches them:
  • Covers auth needed from on prem to the server
  • Accepts HTTP-based payload and files to be uploaded (very handy for bulk upload from on-prem)
  • Lacks routing: limits our ability to scale data processing horizontally
  • Lacks infinite data retention: the original design has kafka infinite retention as key characteristic
  1. We need to evaluate requirements and options we have to implement the system. Another analysis with a few alternatives

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Roadmap

  • Stream events from service to kafka
  • Enable feature flag hiding this feature in staging to gather data
  • Read events and project them to elasticsearch
  • Process on-prem events and re-stream them into the kafka stream
  • Adapt CCX export

We should send events to a kafka stream (to be defined, see MGMT-11245 )

 

We need to make sure messages are below 1MB (kafka limit):

  • split cluster and host events

 

 

Fire cluster state events, infra env only when needed

Embed component versions in each event, as it wouldn't be trivial to associate it (multiple pods, on-prem, etc)

 

 

Epic Goal

  • Enable ZTP converged flow in 4.12 and 4.12 test coverage

Why is this important?

Scenarios

  1. Installing an OCP cluster using ZTP flow with ACM/MCE running on 4.12 HUB cluster 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/METAL-256 

Previous Work (Optional):

  1. https://github.com/openshift/assisted-service/pull/3815
  2. https://github.com/openshift/cluster-baremetal-operator/pull/279

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In some cases, customers might install clusters without various operators, for example without an openshift console.

In such cases, we'll still be waiting for the console operator to complete, which will never happen, hence we'll eventually fail the installation.

We should consider adding logic to address that.

See context in https://coreos.slack.com/archives/C032HSVS71T/p1663107025002509

 

Currently the assisted installer waits for the console operator to be available before considering the cluster ready. This doesn't work if the console has been explicitly disabled by the user, so we need to change the controller so it doesn't wait in that case.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Our current flow includes the following way of interaction between assisted-test-infra and assisted-service:

 

This task suggests changing the way we interact with assisted-service through Python by consuming the openapi specification from a live service and generating all relevant models / requests from it.

Why is this important?

There are multiple shortages with current approach:

  • We have a redundant stage while building assisted-service image, which is considered redundant because it doesn't advance us in the area of building the service but rather in the sense of e2e-testing it. This means more time invested in building this image on actions that doesn't involve those artifacts. In addition, more disk space used when locally testing assisted-service (because of the added image layers)
  • There are two sources of truth when testing customized images: you have whatever installed on the relevant env and you've got to remember to patch test-infra with the right service image.

Scenarios

  1. e2e tests in CI
    1. Either way will be covered pretty well on the process of changing assisted-test-infra
  2. QE's tests
    1. We should leave the two methods available until completing the migration, allowing us to fallback the old way anytime we'd like
    2. Phase 1 API testing / integration env testing - enabling the new client for those jobs
    3. Staging / production envs API testing - same. Enabling the new client when the service is adequate enough to have this ability.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • We'll be removing client generation only when last trace of use of it has been removed

Dependencies (internal and external)

Work will mainly be done in:

  • assisted-service repo
  • assisted-test-infra repo
  • openshift/release repo (to set "feature-flag" of the new client)
  • QE's repos like kni-assisted-installer-auto (if it doesn't abstracts-away internal models of swagger), ocp-edge (to set "feature-flag" of the new client)

Previous Work (Optional):

  1. We previously worked to publish the Python client to PyPI. I cannot see how we can match the Python client version with the relevant assisted-service instance, especially on cases of custom / untagged images.

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Enable support of a BF2 DPU as a worker node

Why is this important?

  • To be able to achieve 2 clusters design while DPUs acting as worker nodes of a virtualized control plane
  •  

We were talking about 2 options:

  1. Add api that will allow skipping certain validations - was decided as risky one for now
  2. Adding hw configuration that will match bluefield card - we called it edge-worker
    1. In this case we check if host is DPU and go to edge-worker HW validation
    2. Relevant only for workers
    3. This change will not require any UI changes and should be relatively small

 

 

 cc: liat gamliel Michael Filanov Avishay Traeger Oved Ourfali 

Manage the effort for adding jobs for release-ocm-2.7 on assisted installer

https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng

 

Merge order:

  1. Add temporary image streams for Assisted Installer migration - day before (make sure images were created)
  2. Add Assisted Installer fast forwards for ocm-2.x release <depends on #1> - meed approval from test-platform team at https://coreos.slack.com/archives/CBN38N3MW 
  3. Branch-out assisted-installer components for ACM 2.(x-1) - <depends on #1, #2> - At the day of the FF
  4. Prevent merging into release-ocm-2.x - <depends on #3> - At the day of the FF
  5. Update BUNDLE_CHANNELS to ocm-2.x on master - <depends on #3> - At the day of the FF
  6. ClusterServiceVersion for release 2.(x-1) branch references "latest" tag <depends on #5> - After  #5
  7. Update external components to AI 2.x <depends on #3> - After a week, if there are no issues update external branches
  8. Remove unused jobs - after 2 weeks

 

1. Proposed title of this feature request

Delete worker nodes using GitOps / ACM workflow

2. What is the nature and description of the request?

We use siteConfig to deploy a cluster using the GitOPS / ACM workflow. We can also use siteConfig to add worker nodes to an existing cluster. However, today we cannot delete a worker node using the GitOps / ACM work flow. We need to go and manually delete the resources (BMH, nmstateConfig etc.) and the OpenShift node. We would like to have the node deleted as part of the GitOps workflow.

3. Why does the customer need this? (List the business requirements here)

Worker nodes may need to be replaced for any reason (hardware failures) which may require deletion of a node.

If we are colocating OpenShift and OpenStack control planes on the same infrastructure (using OpenStack director operator to create OpenStack control plane in OCP virtualization), then we also have the use case of assigning baremetal nodes as OpenShift worker nodes or OpenStack compute nodes. Over time we may need to change the role of those baremetal nodes (from worker to compute or from compute to worker). Having the ability to delete worker nodes via GitOps will make it easier to automate that use case.

4. List any affected packages or components.

ACM, GitOps

There is a requirement to handle removal and cleaning of nodes installed into spoke clusters in the ZTP flow (driven by git ops).

The currently proposed solution for this would use the hub cluster BMH to clean the host as it's already configured and can be used for either BM or non-platform spoke clusters.

This removal should be triggered by the deletion of the BMH, but if the BMH is removed we can't also use it to handle deprovisioning the host.

If another finalizer is configured on the BMH BMO should assume that host is not ready to be deleted.

Testing steps:

  1. Create and provision a BMH with automatedCleaningMode set to "metadata"
  2. Add the detached annotation: baremetalhost.metal3.io/detached: '{"deleteAction":"delay"}'
  3. Delete the BMH

Deprovisioning should wait until the detached annotation is removed, previously the host was deleted before deprovisioning could run.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow a worker to boot from pointer ignition using custom machine config pool instead of the default worker machine config pool

Why is this important?

  • When setting a custom role, the host reboots to use custom MCP that is associated with this custom role.  If the host initially boots using the custom MCP instead of the default worker MCP, the extra reboot is avoided.  This extra reboot may take long time.

Scenarios

  1. Additional machine configuration is needed to be set on some workers in order for them to fulfill certain role (example storage). 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MGMT-13186 - Spike to check the solution feasibility .

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In order to not to reboot node after applying custom role through node labels, we want to pause mcp and as part of setting right machine config poll in pointer ignition it will not require one more node reboot while setting custom role in installation. 

In order to avoid extra reboot when moving to custom MCP, if machine config pool name is already set as part of the host DB record in day 1 installation, the pointer ignition will be modified to use the custom MCP.

Epic Goal

Moving forward with changing the way we test assisted-installer. We should change the way assisted-test-infra and subsystem tests on assisted-service are deploying assisted-service.

Why is this important?

There are lots of issues when running minikube with kvm2 driver, most of them are because of the complex setup (downloading the large ISO image, setting up the libvirt VM, defining registry addon, etc.)

Scenarios

  1. e2e tests on assisted-test-infra
  2. subsystem tests on assisted-service repository

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

DoD: running make run on assisted-test-infra repo will install assisted-service on top of a kind cluster, and the same flow should work on CI.

Right now, we do the following for getting a e2e testing environment:

  • install arkade, as a simple cli for installing other software
  • install minikube using via arkade
  • deploy a minikube cluster with the kvm2 driver, and make sure to install registry addon
  • run minikube tunnel in the background

We should change those actions to support installation and usage of kind, and leverage its ease of use when dealing with images and with service reachability.

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Recently cluster storage controller operator has been moved from guest cluster to control plane namespace in mgmt cluster. 

Which will install the powervs-block-csi-driver operator also in control plane namespace itself. This is the design followed for aws-ebs-csi-driver too.

 

STOR-1038: Reconcile Storage and ClusterCSIDrivers in the guest clusters

 

Goal

This epic has 3 main goals

  1. Improve segment implementation so that we can easily enable additional telemetry pieces (hotjar, etc) for particular cluster types (starting with sandbox, maybe expanding to RHPDS). This will help us better understand where errors and drop off occurs in our trial and workshop clusters, thus being able to (1) help conversion and (2) proactively detect issues before they are "reported" by customers.
  2. Improve telemetry so we can START capturing console usage across the fleet
  3. Additional improvements to segment, to enable proper gathering of user telemetry and analysis

Problem

Currently we have no accurate telemetry of usage of the OpenShift Console across all clusters in the fleet. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.

Acceptance criteria

Let's do a spike to validate, and possibly have to update this list after the spike:

Need to verify HOW do we define a cluster Admin -> Listing all namespaces in a cluster? Install operators? Make sure that we consider OSD cluster admins as well (this should be aligned with how we send people to dev perspective in my mind)

Capture additional information via console plugin ( and possibly the auth operator )

  1. Average number of users per cluster
  2. Average number of cluster admin users per cluster
  3. Average number of dev users per cluster
  4. Average # of page views across the fleet
  5. Average # of page views per perspective across the fleet
  6. # of cluster which have disabled the admin perspective for any users
  7. # of cluster which have disabled the dev perspective for any users
  8. # of cluster which have disabled the “any” perspective for any users
  9. # of clusters which have plugin “x” installed
  10. Total number of unique users across the fleet
  11. Total number of cluster admin users across the fleet
  12. Total number of developer users across the fleet

Dependencies (External/Internal):

Understanding how to capture telemetry via the console operator

Exploration:

Note:

We have removed the following ACs for this release:

  1. (p2) Average total active time spent per User in console (per cluster for all users)
    1. per Cluster Admins
    2. per non-Cluster Admins
  2. (p2) Average active time spent in Dev Perspective [implies we can calculate this for admin perspective]
    1. per Cluster Admins
    2. per non-Cluster Admins-
  3. (p3) Average # of times they change the perspective (per cluster for all users)

This is a clone of issue OCPBUGS-10956. The following is the description of the original issue:

Description of problem:
With 4.13 we added new metrics to the console (Epic ODC-7171 - Improved telemetry (provide new metrics), that collect different user and cluster metrics.

The cluster metrics include:

  1. which perspectives are customized (enabled, disabled, only available for a subset of users)
  2. which plugins are installed and enabled

These metrics contain the perspective name or plugin name which was unbounded. Admins could configure any perspective and plugin name, also if the perspective or plugin with that name is not available.

Based on the feedback in https://github.com/openshift/cluster-monitoring-operator/pull/1910 we need to reduce the cardinality and limit the metrics to, for example:

  1. perspectives: admin, dev, acm, other
  2. plugins: redhat, demo, other

Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
Always

Steps to Reproduce:
On a cluster, you must update the console configuration, configure some perspectives or plugins and check the metrics in Admin > Observe > Metrics:

avg by (name, state) (console_plugins_info)

avg by (name, state) (console_customization_perspectives_info)

On a local machine, you can use this console yaml:

apiVersion: console.openshift.io/v1
kind: ConsoleConfig
plugins: 
  logging-view-plugin: https://logging-view-plugin.logging-view-plugin-namespace.svc.cluster.local:9443/
  crane-ui-plugin: https://crane-ui-plugin.crane-ui-plugin-namespace.svc.cluster.local:9443/
  acm: https://acm.acm-namespace.svc.cluster.local:9443/
  mce: https://mce.mce-namespace.svc.cluster.local:9443/
  my-plugin: https://my-plugin.my-plugin-namespace.svc.cluster.local:9443/
customization: 
  perspectives: 
  - id: admin
    visibility: 
      state: Enabled
  - id: dev
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev1
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev2
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev3
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get

And start the bridge with:

./build-backend.sh
./bin/bridge -config ../config.yaml

After that you can fetch the metrics in a second terminal:

Actual results:

curl -s localhost:9000/metrics | grep ^console_plugins

console_plugins_info{name="acm",state="enabled"} 1
console_plugins_info{name="crane-ui-plugin",state="enabled"} 1
console_plugins_info{name="logging-view-plugin",state="enabled"} 1
console_plugins_info{name="mce",state="enabled"} 1
console_plugins_info{name="my-plugin",state="enabled"} 1
curl -s localhost:9000/metrics | grep ^console_customization

console_customization_perspectives_info{name="dev",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev1",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev2",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev3",state="only-for-developers"} 1

Expected results:
Less cardinality, that means, results should be grouped somehow.

Additional info:

This is a clone of issue OCPBUGS-12439. The following is the description of the original issue:

As Red Hat, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.

Eigther the console-operator or cluster-monitoring-operator needs to apply a PrometheusRule to collect the right data and make these later available in Superset DataHat or Tableau.

Description

As RH PM/engineer, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.

Acceptance Criteria

  1. Send metrics (page views and impression events) to the usage endpoint created in ODC-7232 so that we can see how active the console is used.

Additional Details:

Description

As RH PM/engineer, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.

Acceptance Criteria

  1. Add new prometheus metrics based on the Epic ODC-7171 and the created metrics documentation https://docs.google.com/document/d/1PqbKv_-q2PW8mK3lwGEjpLwO5jdf9TojOchnjUY9YMU/edit#
  2. Add a new endpoint to get events from the frontend so that we can also track page navigations via history.push

Additional Details:

Ability to gather logs from CMA via must gather 

May need to check if CMA operator is present, then gather CMA specific logs and config.

https://github.com/openshift/must-gather/tree/master/collection-scripts

As an OpenShift developer, I would like must-gather to collect info helpful to diagnosing issues with Custom Metrics Autoscaler. Please gather information similar to what one could get from running the following commands:

oc get -A kedacontrollers.keda.sh -o yaml
oc get -A scaledjobs.keda.sh -o yaml
oc get -A scaledobjects.keda.sh -o yaml
oc get -A triggerauthentications.keda.sh -o yaml
oc get csv -n openshift-keda -l operators.coreos.com/openshift-custom-metrics-autoscaler-operator.openshift-keda -o yaml

(these probably don't need to be done since they're probably already in must-gather, but please verify)
oc get deployment -n openshift-keda -o yaml
oc get operatorgroup -n openshift-keda -o yaml
oc get subscription -n openshift-keda -o yaml

Make sure that the script ignores any failures to fetch (as might happen if the CRDs haven't been added to the cluster when the operator is installed).

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • Every SDN feature should have a corresponding e2e test
  • Some fixes should have regression tests

Why is this important?

  • We want to catch new bugs before they merge, rather than only finding out about them when QE finds them
  • e2e tests run in a variety of environments and may catch bugs that QE testing does not

Acceptance Criteria

  • Every SDN feature has an e2e test, ideally as part openshift-tests, but possibly in some other test suite if it's not possible to test them in a stock environment
  • Release Technical Enablement - N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

slack thread here

essentially, when a readiness probe fails (a check for /etc/cni/net.d/10-ovn-kubernetes.conf)
when ovnkubenode DBs are still coming up it could fail CI if the probe fail
message is seen more than 20 times. The test case that fails is "events should
not repeat pathologically" As of the creation of this bug, it's failing in CI
less than 1% of all failed OVN jobs. The error message is benign and expected
in this scenario.

some solutions to stop this from failing CI could be:

  • enhance the test code to understand this specific failure and ignore it
  • reduce the polling and/or increase the poll timers so that this message
    would have no chance to occur more than 20 times in a normal scenario

example job here

Goal: Support OVN-IPsec on IBM Cloud platform.

Why is this important: IBM Cloud is being added as a new OpenShift supported platform, targeting 4.9/4.10 GA.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

  • Need to have permission to spin up IBM clusters

Not in scope:

Estimate (XS, S, M, L, XL, XXL):

Previous Work:

Open questions:

Acceptance criteria:

Epic Done Checklist:

  • CI - CI Job & Automated tests: <link to CI Job & automated tests>
  • Release Enablement: <link to Feature Enablement Presentation> 
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • Notes for Done Checklist
    • Adding links to the above checklist with multiple teams contributing; select a meaningful reference for this Epic.
    • Checklist added to each Epic in the description, to be filled out as phases are completed - tracking progress towards “Done” for the Epic.

For IPsec support on IBM, we need to enable NAT-T. For that, we need:

OVN: https://github.com/ovn-org/ovn/commit/d6dd8e49551141159f040406202f8550c18a1846
OVS: https://github.com/openvswitch/ovs/commit/e8515c8cc082964f7611e6f03300e614b9b8eaca

There's a high likelihood that this will not make it into earlier versions of OVS, but it is upstream in OVS 3.0 now. So when we pull that into OCP, the fix will be there.
Check https://coreos.slack.com/archives/C01G7T6SYSD/p1658413488130169 eventually.

The upstream patch for 2.17 is here, but there's a chance that it may be declined: http://patchwork.ozlabs.org/project/openvswitch/patch/20220722101122.19470-1-ak.karis@gmail.com/

I can't find any openvswitch3.0 downstream RPMs (https://brewweb.engineering.redhat.com/brew/search?match=glob&type=package&terms=openvswitch*) ; I suppose that we do not have them, yet

We have a consistent complication where developers miss or ignore job failures on presubmits, because they don't trust the jobs which sometimes have overall pass rates under 30%.

We have a systemic problem with flaky tests and jobs. Few pay attention anymore, and even fewer people know how to distinguish serious failures from the noise.

Just fixing the test and jobs is infeasible, piece by piece maybe but we do not have the time to invest in what would be a massive effort.

Sippy now has presubmit data throughout the history of a PR.

Could sippy analyze the presubmits for every PR, check test failures against their current pass rate, filter out noise from on-going incidents, and then comment on PRs letting developers know what's really going on.

As an example:

job foo - failure severity: LOW

  • test a failed x times, current pass rate 40%, flake rate 20%

job bar - failure severity: HIGH

  • test b failed 2 times, current pass rate 99%

job zoo - failure severity: UNKNOWN

  • on-going incident: Azure Install Failures (TRT-XXX)

David requests this get published in the job as a spyglass panel, gives a historical artifact. We'd likely do both so we know they see comments.

This epic will cover TRTs project to enhance Sippy to categorize the likely severity of test failures in a bad job run, store this as a historical artifact on the job run, and communicate it directly to developers in their PRs via a comment.

i.e. install failures.

May require some very explicit install handling?

Possible simple solution: Lookup a successful job run, see how many tests it ran, make sure we're in range of that.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Continuation of https://issues.redhat.com/browse/WRKLDS-488.

Acceptance criteria

  • Microshift test results are available in Sippy

Tests that need to be refactored for MicroShift (based on comments in https://github.com/openshift/origin/pull/27359)

Tests that need to be refactored for MicroShift (based on comments in https://github.com/openshift/origin/pull/27359)

  • networking
    • when running openshift ipv4 cluster on bare metal
  • olm
    • have imagePullPolicy:IfNotPresent on thier deployments
  • operators
    • [sig-arch] Managed cluster should
  • prometheus
    • [sig-instrumentation][Late] OpenShift alerting rules
  • router
    • should not work when configured with a 1024-bit RSA key
    • [sig-network][Feature:Router]
    • should pass the h2spec conformance tests
  • security
    • Ensure supplemental groups propagate to docker should propagate requested groups to the container

The router tests got updated through https://github.com/openshift/origin/pull/27476 (removing template.openshift.io group).

"[sig-api-machinery][Feature:ServerSideApply] Server-Side Apply should work for security.openshift.io/v1, Resource=rangeallocations [apigroup:security.openshift.io] [Suite:openshift/conformance/parallel]"

 

Test fails because rangeallocations is not a valid resource even if API group is present.

sig-cli is failing in two different ways:

  • Missing api resources from api groups.
  • A bug where the loop variables are not captured in closures, rendering random errors on each execution because they get overwritten for past It functions.

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

This is a clone of issue OCPBUGS-11304. The following is the description of the original issue:

Description of problem:

Nodes are taking more than 5m0s to stage OSUpdate

https://sippy.dptools.openshift.org/sippy-ng/tests/4.13/analysis?test=%5Bbz-Machine%20Config%20Operator%5D%20Nodes%20should%20reach%20OSUpdateStaged%20in%20a%20timely%20fashion 

Test started failing back on 2/16/2023. First occurrence of the failure https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade/1626326464246845440 

Most recent occurrences across multiple platforms https://search.ci.openshift.org/?search=Nodes+should+reach+OSUpdateStaged+in+a+timely+fashion&maxAge=48h&context=1&type=junit&name=4.13&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

6 nodes took over 5m0s to stage OSUpdate:node/ip-10-0-216-81.ec2.internal OSUpdateStarted at 2023-02-16T22:24:56Z, did not make it to OSUpdateStaged
node/ip-10-0-174-123.ec2.internal OSUpdateStarted at 2023-02-16T22:13:07Z, did not make it to OSUpdateStaged
node/ip-10-0-144-29.ec2.internal OSUpdateStarted at 2023-02-16T22:12:50Z, did not make it to OSUpdateStaged
node/ip-10-0-179-251.ec2.internal OSUpdateStarted at 2023-02-16T22:15:48Z, did not make it to OSUpdateStaged
node/ip-10-0-180-197.ec2.internal OSUpdateStarted at 2023-02-16T22:19:07Z, did not make it to OSUpdateStaged
node/ip-10-0-213-155.ec2.internal OSUpdateStarted at 2023-02-16T22:19:21Z, did not make it to OSUpdateStaged}

Expected results:

 

Additional info:

 

Description of problem:

`Availability requirement` dropdown option 'maxUnavailable' and 'minAvailable' on PDB creation page are not pesudo translated 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-09-213203

How reproducible:

Always

Steps to Reproduce:

1. Create an example deployment
2. view console in pseudo translation by adding ?pseudolocalization=true&lng=en suffix
3. goes to PodDisruptionBudget creation page by clicking the kebab item 'Add PodDisruptionBudget' on deployment list page or 'Add PodDisruptionBudget' action item on deployment detail page
4. check translations on the PDB creation page

Actual results:

4. `Availability requirement` dropdown option 'maxUnavailable' and 'minAvailable' are not pesudo translated 

Expected results:

4. the dropdown items should also be pesudo translated 

Additional info:

 

Description of problem:

Agent-tui should show before the installation, but it shows again during the installation and when it quit again, the installation fail to go on.

Version-Release number of selected component (if applicable):

4.13.0-0.ci-2023-03-14-045458

How reproducible:

always

Steps to Reproduce:

1. Make sure the primary check pass, and boot the agent.x86_64.iso file, we can see the agent-tui show before the installation

2. Tracking installation by both wait-for output and console output

3. The agent-tui show again during the installation, wait for the agent-tui quit automatically without any user interruption, the installation quit with failure, and we have the following wait-for output:

DEBUG asset directory: .                           
DEBUG Loading Agent Config...                      
...
DEBUG Agent Rest API never initialized. Bootstrap Kube API never initialized 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
DEBUG Agent Rest API Initialized                   
INFO Cluster is not ready for install. Check validations 
DEBUG Cluster validation: The pull secret is set.  
WARNING Cluster validation: The cluster has hosts that are not ready to install. 
DEBUG Cluster validation: The cluster has the exact amount of dedicated control plane nodes. 
DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: The Cluster Network CIDR is defined. 
DEBUG Cluster validation: The base domain is defined. 
DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: The Machine Network CIDR is defined. 
DEBUG Cluster validation: The Cluster Machine CIDR is not required: User Managed Networking 
DEBUG Cluster validation: The Cluster Network prefix is valid. 
DEBUG Cluster validation: The cluster has a valid network type 
DEBUG Cluster validation: Same address families for all networks. 
DEBUG Cluster validation: No CIDRS are overlapping. 
DEBUG Cluster validation: No ntp problems found    
DEBUG Cluster validation: The Service Network CIDR is defined. 
DEBUG Cluster validation: cnv is disabled          
DEBUG Cluster validation: lso is disabled          
DEBUG Cluster validation: lvm is disabled          
DEBUG Cluster validation: odf is disabled          
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Valid inventory exists for the host 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient minimum RAM 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient disk capacity 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores for role master 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient RAM for role master 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is unique in cluster 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is allowed 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Speed of installation disk has not yet been measured 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is compatible with cluster platform none 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: VSphere disk.EnableUUID is enabled for this virtual machine 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host agent compatibility checking is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No request to skip formatting of the installation disk 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All disks that have skipped formatting are present in the host inventory 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is connected 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Media device is connected 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No Machine Network CIDR needed: User Managed Networking 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host belongs to all machine network CIDRs 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has connectivity to the majority of hosts in the cluster 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Platform PowerEdge R740 is allowed 
WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host couldn't synchronize with any NTP server 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host clock is synchronized with service 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All required container images were either pulled successfully or no attempt was made to pull them 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Network latency requirement has been satisfied. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Packet loss requirement has been satisfied. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has been configured with at least one default route. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api-int.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the *.apps.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host subnets are not overlapping 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No IP collisions were detected by host 7a9649d8-4167-a1f9-ad5f-385c052e2744 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: cnv is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lso is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lvm is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: odf is disabled 
WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server) 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host NTP is synced 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from insufficient to known (Host is ready to be installed) 
INFO Cluster is ready for install                 
INFO Cluster validation: All hosts in the cluster are ready to install. 
INFO Preparing cluster for installation           
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: New image status registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:b0d518907841eb35adbc05962d4b2e7d45abc90baebc5a82d0398e1113ec04d0. result: success. time: 1.35 seconds; size: 401.45 Megabytes; download rate: 312.54 MBps 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) 
INFO Cluster installation in progress             
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-successful to installing (Installation is in progress) 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Starting installation: bootstrap 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Installing: bootstrap 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:f85a278868035dc0a40a66ea7eaf0877624ef9fde9fc8df1633dc5d6d1ad4e39 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "...  to initialize single run daemon: error initializing rpm-ostree: Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists" 
INFO Cluster has hosts in error                   
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation   

4. During the installation, we had NetworkManager-wait-online.service for a while:
-- Logs begin at Wed 2023-03-15 03:06:29 UTC, end at Wed 2023-03-15 03:27:30 UTC. --
Mar 15 03:18:52 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Starting Network Manager Wait Online...
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Failed to start Network Manager Wait Online.

Expected results:

The TUI should only show once before the installation.

This is a clone of issue OCPBUGS-12913. The following is the description of the original issue:

Description of problem

CI is flaky because the TestRouterCompressionOperation test fails.

Version-Release number of selected component (if applicable)

I have seen these failures on 4.14 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 7.71% of runs (16.58% of failures) across 402 total runs and 24 jobs (46.52% failed)

GCP is most impacted:

pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator (all) - 44 runs, 86% failed, 37% of failures match = 32% impact

Azure and AWS are also impacted:

pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator (all) - 36 runs, 64% failed, 43% of failures match = 28% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 38 runs, 79% failed, 23% of failures match = 18% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=compression+error%3A+expected&maxAge=336h&context=1&type=build-log&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails:

TestAll/serial/TestRouterCompressionOperation 
=== RUN   TestAll/serial/TestRouterCompressionOperation
    router_compression_test.go:209: compression error: expected "gzip", got "" for canary route

Expected results

CI passes, or it fails on a different test.

Description of problem:

On cluster setting page, it shows available upgrade on page. After user chooses one target version and clicks "Upgrade", wait for a long time, there is no info about upgrade status.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Always

Steps to Reproduce:

1.Login console with available upgrade for the cluster, select a target version in the available version list. Then click "Update". Check the upgrade progress on the cluster setting page.
2.Check upgrade info from client with "oc adm upgrade".
3.

Actual results:

1.There is not any information or upgrade progress shown on the page.
2.It shows info about retrieving target version failed.
$ oc adm upgrade 
Cluster version is 4.12.0-0.nightly-2022-10-25-210451
  ReleaseAccepted=False  
  Reason: RetrievePayload
  Message: Retrieving payload failed version="4.12.0-0.nightly-2022-10-27-053332" image="registry.ci.openshift.org/ocp/release@sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1" failure=The update cannot be verified: unable to verify sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1 against keyrings: verifier-public-key-redhatUpstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph
Channel: stable-4.12
Recommended updates:  
  VERSION                            IMAGE
  4.12.0-0.nightly-2022-11-01-135441 registry.ci.openshift.org/ocp/release@sha256:f79d25c821a73496f4664a81a123925236d0c7818fd6122feb953bc64e91f5d0
  4.12.0-0.nightly-2022-10-31-232349 registry.ci.openshift.org/ocp/release@sha256:cb2d157805abc413394fc579776d3f4406b0a2c2ed03047b6f7958e6f3d92622
  4.12.0-0.nightly-2022-10-28-001703 registry.ci.openshift.org/ocp/release@sha256:c914c11492cf78fb819f4b617544cd299c3a12f400e106355be653c0013c2530
  4.12.0-0.nightly-2022-10-27-053332 registry.ci.openshift.org/ocp/release@sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1

Expected results:

1. It should also show this kind of message on console page if retrieving target payload failed, so that user knows the actual result after try to upgrade.

Additional info:

 

Description of problem:

When the user supplies nmstateConfig in agent-config.yaml invalid configurations may not be detected

Version-Release number of selected component (if applicable):

4.12

How reproducible:

every time

Steps to Reproduce:

1. Create an invalid NM config. In this case an interface was defined with a route but no IP address 
2. The ISO can be generated with no errors
3. At run time the invalid was detected by assisted-service, create-cluster-and-infraenv.service logged the error "failed to validate network yaml for host 0, invalid yaml, error:"
 

Actual results:

Installation failed

Expected results:

Invalid configuration would be detected when ISO is created

Additional info:

It looks like the ValidateStaticConfigParams check is ONLY done when the nmstateconfig is provided in nmstateconfig.yaml, not when the file is generated (supplied in agent-config.yaml). https://github.com/openshift/installer/blob/master/pkg/asset/agent/manifests/nmstateconfig.go#L188

 

 

Description of problem:

Missing i18n key for PAC section in Git import form

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Install Red Hat Pipeline operator

Steps to Reproduce:

1. Navigate to Import from Git form
2. Enter Git URL https://github.com/karthikjeeyar/demo-app 
3. Open the browser console

Actual results:

Missing i18n key error in the console

Expected results:

Should not show missing i18n key error in the console

Additional info:

 

This is a clone of issue OCPBUGS-7973. The following is the description of the original issue:

Description of problem:

After destroyed the private cluster, the cluster's dns records left.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-02-26-022418 
4.13.0-0.nightly-2023-02-26-081527 

How reproducible:

always

Steps to Reproduce:

1.create a private cluster
2.destroy the cluster
3.check the dns record  
$ibmcloud dns zones | grep private-ibmcloud.qe.devcluster.openshift.com (base_domain)
3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b   private-ibmcloud.qe.devcluster.openshift.com     PENDING_NETWORK_ADD
$zone_id=3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b
$ibmcloud dns resource-records $zone_id
CNAME:520c532f-ca61-40eb-a04e-1a2569c14a0b   api-int.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com   CNAME   60    10a7a6c7-jp-tok.lb.appdomain.cloud   
CNAME:751cf3ce-06fc-4daf-8a44-bf1a8540dc60   api.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com       CNAME   60    10a7a6c7-jp-tok.lb.appdomain.cloud   
CNAME:dea469e3-01cd-462f-85e3-0c1e6423b107   *.apps.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com    CNAME   120   395ec2b3-jp-tok.lb.appdomain.cloud 

Actual results:

the dns records of the cluster were left

Expected results:

created dns record by installer are all deleted, after destroyed the cluster

Additional info:

this block create private cluster later, caused the maximum limit of 5 wildcard records are easily reached. (qe account limitation)
checking the *ingress-operator.log of the failed cluster, got the error: "createOrUpdateDNSRecord: failed to create the dns record: Reached the maximum limit of 5 wildcard records."

Description of problem:

When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always occurs

Steps to Reproduce:

1.Create a project with at least 12 components (Apps, Operators, knative Brokers)
2. Go to the Administrator Viewpoint
3. Switch to Developer Viewpoint/Topology
4. No components displayed
5. Click on 'fit to screen'
6. All components appear

Actual results:

Topology renders with all controls but no components visible (see screenshot 1)

Expected results:

All components should be visible

Additional info:

 

Description of problem:

The cluster-ingress-operator log output is a little noisy when starting the operator's controllers, in part because of the way in which the configurable-route controller configures its watches.

Version-Release number of selected component (if applicable):

4.10+.

How reproducible:

Always.

Steps to Reproduce:

1. Check the ingress-operator logs, and search for "configurable_route_controller": oc -n openshift-ingress-operator logs -c ingress-operator deploy/ingress-operator | grep -e configurable_route_controller

Actual results:

The operator emits log messages like the following on startup:

2022-11-23T08:47:35.646-0600    INFO    operator.init   controller/controller.go:241    Starting EventSource    {"controller": "configurable_route_controller", "source": "&{{%!s(*v1.Role=&{{ } {      0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] [] []} []}) %!s(*cache.multiNamespaceCache=&{map[openshift-config:0xc000712110 openshift-config-managed:0xc000712108 openshift-ingress:0xc0007120f8 openshift-ingress-canary:0xc000712100 openshift-ingress-operator:0xc0007120e8] 0xc000261ea0 0xc00010e190 0xc0007120e0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
2022-11-23T08:47:35.646-0600    INFO    operator.init   controller/controller.go:241    Starting EventSource    {"controller": "configurable_route_controller", "source": "&{{%!s(*v1.RoleBinding=&{{ } {      0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] [] []} [] {  }}) %!s(*cache.multiNamespaceCache=&{map[openshift-config:0xc000712110 openshift-config-managed:0xc000712108 openshift-ingress:0xc0007120f8 openshift-ingress-canary:0xc000712100 openshift-ingress-operator:0xc0007120e8] 0xc000261ea0 0xc00010e190 0xc0007120e0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"}
2022-11-23T08:47:35.646-0600    INFO    operator.init   controller/controller.go:241    Starting Controller     {"controller": "configurable_route_controller"}

Expected results:

The operator should emit log messages like the following on startup:

2022-11-23T08:48:43.076-0600    INFO    operator.init   controller/controller.go:241    Starting EventSource    {"controller": "configurable_route_controller", "source": "kind source: *v1.Role"}
2022-11-23T08:48:43.078-0600    INFO    operator.init   controller/controller.go:241    Starting EventSource    {"controller": "configurable_route_controller", "source": "kind source: *v1.RoleBinding"}
2022-11-23T08:48:43.078-0600    INFO    operator.init   controller/controller.go:241    Starting Controller     {"controller": "configurable_route_controller"}

Additional info:

The cited noisiness results from two issues. First, the configurable-route controller needlessly uses source.NewKindWithCache() to configure its watches when it would be sufficient and slightly simpler to use source.Kind.

Second, recent versions of controller-runtime have excessively noisy logging for the kindWithCache source type. The configurable-route controller was introduced in OpenShift 4.8, which uses controller-runtime v0.9.0-alpha.1. OpenShift 4.9 has controller-runtime v0.9.0, OpenShift 4.10 has controller-runtime v0.11.0, and OpenShift 4.11 has controller-runtime v0.12.0. A change in controller-runtime v0.11.0 causes the noisiness. Before this change, the output was excessively quiet, for example:

2022-09-28T20:51:40.979Z	INFO	operator.init.controller-runtime.manager.controller.configurable_route_controller	controller/controller.go:221	Starting EventSource	{"source": {}}
2022-09-28T20:51:40.979Z	INFO	operator.init.controller-runtime.manager.controller.configurable_route_controller	controller/controller.go:221	Starting EventSource	{"source": {}}

I have filed an issue upstream to improve the logging for kindWithCache: https://github.com/kubernetes-sigs/controller-runtime/pull/2057

Context:

As we start receiving metrics consistently in OCM environments and we are creating SLOs dashboards that can consume data from any data source Prod/stage/CI we also want to revisit how we are sending metrics and make sure we are doing it int the most effective way. We have some wonky data coming through in prod atm.

DoD:

Atm we have high frequency reconciliation loop where we constantly review the over all state of the world by looping over all clusters.

We should review this approach and record metrics/events as it happens directly in the controllers/reconcile loop only once and not repeatedly in a loop when possible for each specific metric.

Description of problem:

This e2e test has a few flakes:

[sig-storage][Feature:DisableStorageClass][Serial] should not reconcile the StorageClass when StorageClassState is Unmanaged [Suite:openshift/conformance/serial]

Example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/36373/rehearse-36373-periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-upi-serial/1625549967886127104

The test was introduced in https://github.com/openshift/origin/pull/27704

From sippy, it seems only the Unmanaged test is flaking (on vSphere and AWS). This test seems particularly racy too. We set SC state to Unmanaged, immediately set AllowVolumeExpansion, and then periodically check the value of AllowVolumeExpansion until the test gives up.

This model works for the Managed and Removed tests, where we expect the operator to reconcile the state eventually. But for unmanaged, we're changing something and expecting the operator NOT to reconcile it. This means we could set SC state to Unmanaged, immediately change AllowVolumeExpansion, and the operator could still revert that change before noticing the new StorageClassState value. But we never attempt to set AllowVolumeExpansion in this test again after the first try.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

rarely (98% pass rate)

Steps to Reproduce:

1.
2.
3.

Actual results:

the Unmanaged tests occasionally fail

Expected results:

all DisableStorageClass tests pass consistently

Additional info:

4.13 test results in sippy

Description of the problem:

The assisted-service complete the Agent installation but the agent stage remains Rebooting. 

The assisted-service log is full with:

time="2022-11-13T10:07:04Z" level=error msg="Failed updating host b70524e0-3660-4892-be9b-fd8f40a61a30 install progress" func="github.com/openshift/assisted-service/internal/controller/controllers.(*AgentReconciler).UpdateDay2InstallPogress" file="/remote-source/assisted-service/app/internal/controller/controllers/agent_controller.go:641" error="Can't set progress <Done> to host in status <added-to-existing-cluster>" 

DB shows:

installer=# select id, status_info, status from hosts;
                  id                  | status_info |          status           
--------------------------------------+-------------+---------------------------
 1e426d51-af09-442a-a8c4-1f0f4fa5676a | Rebooting   | added-to-existing-cluster
 835f43b3-596e-4f2d-85dd-f385f9113d64 | Rebooting   | added-to-existing-cluster
 b70524e0-3660-4892-be9b-fd8f40a61a30 | Rebooting   | added-to-existing-cluster 

 

How reproducible:

100%

Steps to reproduce:

1. Add workers to day2 cluster

2. watch the agent stage

 

Actual results:

 [root@ocp-edge99 ~]# oc get agent -A
NAMESPACE   NAME                                   CLUSTER   APPROVED   ROLE     STAGE
hyper-0     1e426d51-af09-442a-a8c4-1f0f4fa5676a   hyper-0   true       worker   Rebooting
hyper-0     835f43b3-596e-4f2d-85dd-f385f9113d64   hyper-0   true       worker   Rebooting
hyper-0     b70524e0-3660-4892-be9b-fd8f40a61a30   hyper-0   true       worker   Rebooting

 

Expected results:

Expected the stage to be done.
Also expected to see the stage goes through Configuring and Joined stages.

 

Description of problem:

While installing kube-api server is not stable and keepalived-monitor can stuck while trying to get nodes

Version-Release number of selected component (if applicable):

4.12.0-ec5

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 At the moment we detect virtualized environments based on vendor name and product.

This has obvious downside as the list needs to be kept up to date to return a reliable answer.

We should make it more dynamic, see systemd-detect-virt

 

Kubernetes 1.27 removes long deprecated --container-runtime flag, see https://github.com/kubernetes/kubernetes/pull/114017

To ensure the upgrade path between 4.13 to 4.14 isn't affected we need to backport the changes to both 4.14 and 4.13.

Tracker issue for bootimage bump in 4.13. This issue should block issues which need a bootimage bump to fix.

Description of problem:

Provisioning on ilo4-virtualmedia BMC driver fails with error: "Creating vfat image failed: Unexpected error while running command"

Version-Release number of selected component (if applicable):

4.13 (but will apply to older OpenShift versions too)

How reproducible:

Always

Steps to Reproduce:

1.configure some nodes with ilo4-virtualmedia://
2.attempt provisioning
3.

Actual results:

provisioning fails with error similar to  Failed to inspect hardware. Reason: unable to start inspection: Validation of image href https://10.1.235.67:6183/ilo/boot-9db13f93-861a-4d27-b20d-2c228559faa2.iso failed, reason: HTTPSConnectionPool(host='10.1.235.67', port=6183): Max retries exceeded with url: /ilo/boot-9db13f93-861a-4d27-b20d-2c228559faa2.iso (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)')))

Expected results:

Provisioning succeeds

Additional info:

This happens after a preceding issue with missing iLO driver configuration has been fixed (https://github.com/metal3-io/ironic-image/pull/402)

Description of the problem:

Currently, the message which sent by the service is not correct, the term "master" should be replaced by "control plane", and the right number of control planes should be exactly 3.

How reproducible:

100%

Steps to reproduce:

Create a cluster with 3 hosts

Actual results:

validations_info: "The cluster has a sufficient number of master candidates."

Expected results:

validations_info: "The cluster has the exact amount of dedicated control plane nodes."

This is a clone of issue OCPBUGS-8271. The following is the description of the original issue:

Description of problem:

The kube-controller-manager container cluster-policy-controller will show unusual error logs ,such as "
I0214 10:49:34.698154       1 interface.go:71] Couldn't find informer for template.openshift.io/v1, Resource=templateinstances
I0214 10:49:34.698159       1 resource_quota_monitor.go:185] QuotaMonitor unable to use a shared informer for resource "template.openshift.io/v1, Resource=templateinstances": no informer found for template.openshift.io/v1, Resource=templateinstances
"

Version-Release number of selected component (if applicable):

 

How reproducible:

when the cluster-policy-controller restart ,u will see these logs

Steps to Reproduce:

1.oc logs kube-controller-manager-master0 -n openshift-kube-controller-manager -c cluster-policy-controller  

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-02-175114

How reproducible:

always after several times

Steps to Reproduce:

1.Install a cluster 
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2023-01-02-175114   True        False         30m     Cluster version is 4.12.0-0.nightly-2023-01-02-175114
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      33m     
baremetal                                  4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
cloud-controller-manager                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      84m     
cloud-credential                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
cluster-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
cluster-autoscaler                         4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
config-operator                            4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
console                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      33m     
control-plane-machine-set                  4.12.0-0.nightly-2023-01-02-175114   True        False         False      79m     
csi-snapshot-controller                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
dns                                        4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
etcd                                       4.12.0-0.nightly-2023-01-02-175114   True        False         False      79m     
image-registry                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
ingress                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
insights                                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      21m     
kube-apiserver                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      77m     
kube-controller-manager                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      77m     
kube-scheduler                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      77m     
kube-storage-version-migrator              4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
machine-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      75m     
machine-approver                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
machine-config                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
marketplace                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
monitoring                                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      72m     
network                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      83m     
node-tuning                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
openshift-apiserver                        4.12.0-0.nightly-2023-01-02-175114   True        False         False      75m     
openshift-controller-manager               4.12.0-0.nightly-2023-01-02-175114   True        False         False      76m     
openshift-samples                          4.12.0-0.nightly-2023-01-02-175114   True        False         False      22m     
operator-lifecycle-manager                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2023-01-02-175114   True        False         False      75m     
platform-operators-aggregated              4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
service-ca                                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
storage                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   85m
huliu-aws4d2-fcks7-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   85m
huliu-aws4d2-fcks7-master-2                  Running   m6i.xlarge   us-east-2   us-east-2a   85m
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running   m6i.xlarge   us-east-2   us-east-2a   80m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running   m6i.xlarge   us-east-2   us-east-2a   80m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running   m6i.xlarge   us-east-2   us-east-2b   80m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   86m

2.Edit controlplanemachineset, change instanceType to another value to trigger RollingUpdate 
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE          TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-0                  Running        m6i.xlarge   us-east-2   us-east-2a   86m
huliu-aws4d2-fcks7-master-1                  Running        m6i.xlarge   us-east-2   us-east-2b   86m
huliu-aws4d2-fcks7-master-2                  Running        m6i.xlarge   us-east-2   us-east-2a   86m
huliu-aws4d2-fcks7-master-mbgz6-0            Provisioning   m5.xlarge    us-east-2   us-east-2a   5s
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running        m6i.xlarge   us-east-2   us-east-2a   81m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running        m6i.xlarge   us-east-2   us-east-2a   81m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running        m6i.xlarge   us-east-2   us-east-2b   81m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-0                  Deleting   m6i.xlarge   us-east-2   us-east-2a   92m
huliu-aws4d2-fcks7-master-1                  Running    m6i.xlarge   us-east-2   us-east-2b   92m
huliu-aws4d2-fcks7-master-2                  Running    m6i.xlarge   us-east-2   us-east-2a   92m
huliu-aws4d2-fcks7-master-mbgz6-0            Running    m5.xlarge    us-east-2   us-east-2a   5m36s
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running    m6i.xlarge   us-east-2   us-east-2a   87m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running    m6i.xlarge   us-east-2   us-east-2a   87m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running    m6i.xlarge   us-east-2   us-east-2b   87m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE         TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-1                  Running       m6i.xlarge   us-east-2   us-east-2b   101m
huliu-aws4d2-fcks7-master-2                  Running       m6i.xlarge   us-east-2   us-east-2a   101m
huliu-aws4d2-fcks7-master-mbgz6-0            Running       m5.xlarge    us-east-2   us-east-2a   15m
huliu-aws4d2-fcks7-master-nbt9g-1            Provisioned   m5.xlarge    us-east-2   us-east-2b   3m1s
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running       m6i.xlarge   us-east-2   us-east-2a   96m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running       m6i.xlarge   us-east-2   us-east-2a   96m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running       m6i.xlarge   us-east-2   us-east-2b   96m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-1                  Deleting   m6i.xlarge   us-east-2   us-east-2b   149m
huliu-aws4d2-fcks7-master-2                  Running    m6i.xlarge   us-east-2   us-east-2a   149m
huliu-aws4d2-fcks7-master-mbgz6-0            Running    m5.xlarge    us-east-2   us-east-2a   62m
huliu-aws4d2-fcks7-master-nbt9g-1            Running    m5.xlarge    us-east-2   us-east-2b   50m
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running    m6i.xlarge   us-east-2   us-east-2a   144m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running    m6i.xlarge   us-east-2   us-east-2a   144m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running    m6i.xlarge   us-east-2   us-east-2b   144m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-1                  Deleting   m6i.xlarge   us-east-2   us-east-2b   4h12m
huliu-aws4d2-fcks7-master-2                  Running    m6i.xlarge   us-east-2   us-east-2a   4h12m
huliu-aws4d2-fcks7-master-mbgz6-0            Running    m5.xlarge    us-east-2   us-east-2a   166m
huliu-aws4d2-fcks7-master-nbt9g-1            Running    m5.xlarge    us-east-2   us-east-2b   153m
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running    m6i.xlarge   us-east-2   us-east-2a   4h7m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running    m6i.xlarge   us-east-2   us-east-2a   4h7m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running    m6i.xlarge   us-east-2   us-east-2b   4h7m

3.master-1 stuck in Deleting, and many co get degraded, many pod cannot get Running  
liuhuali@Lius-MacBook-Pro huali-test % oc get co     
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2023-01-02-175114   True        True          True       9s      APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver (2 containers are waiting in pending apiserver-7b65bbc76b-mxl99 pod)...
baremetal                                  4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
cloud-controller-manager                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h11m   
cloud-credential                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
cluster-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
cluster-autoscaler                         4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
config-operator                            4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
console                                    4.12.0-0.nightly-2023-01-02-175114   False       False         False      150m    RouteHealthAvailable: console route is not admitted
control-plane-machine-set                  4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h7m    Observed 1 replica(s) in need of update
csi-snapshot-controller                    4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h9m    CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods...
dns                                        4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
etcd                                       4.12.0-0.nightly-2023-01-02-175114   True        True          True       4h7m    GuardControllerDegraded: Missing operand on node ip-10-0-79-159.us-east-2.compute.internal...
image-registry                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h2m    
ingress                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h2m    
insights                                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      3h8m    
kube-apiserver                             4.12.0-0.nightly-2023-01-02-175114   True        True          True       4h5m    GuardControllerDegraded: Missing operand on node ip-10-0-79-159.us-east-2.compute.internal
kube-controller-manager                    4.12.0-0.nightly-2023-01-02-175114   True        False         True       4h5m    GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.19.115:9091: i/o timeout
kube-scheduler                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h5m    
kube-storage-version-migrator              4.12.0-0.nightly-2023-01-02-175114   True        False         False      162m    
machine-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h3m    
machine-approver                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
machine-config                             4.12.0-0.nightly-2023-01-02-175114   False       False         True       139m    Cluster not available for [{operator 4.12.0-0.nightly-2023-01-02-175114}]: error during waitForDeploymentRollout: [timed out waiting for the condition, deployment machine-config-controller is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)]
marketplace                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
monitoring                                 4.12.0-0.nightly-2023-01-02-175114   False       True          True       144m    reconciling Prometheus Operator Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator: got 1 unavailable replicas
network                                    4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h11m   DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)...
node-tuning                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h7m    
openshift-apiserver                        4.12.0-0.nightly-2023-01-02-175114   False       True          False      151m    APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
openshift-controller-manager               4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h4m    
openshift-samples                          4.12.0-0.nightly-2023-01-02-175114   True        False         False      3h10m   
operator-lifecycle-manager                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2023-01-02-175114   True        False         False      2m44s   
platform-operators-aggregated              4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h2m    
service-ca                                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
storage                                    4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h2m    AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test % 


liuhuali@Lius-MacBook-Pro huali-test % oc get pod --all-namespaces|grep -v Running
NAMESPACE                                          NAME                                                                       READY   STATUS              RESTARTS         AGE
openshift-apiserver                                apiserver-5cbdf985f9-85z4t                                                 0/2     Init:0/1            0                155m
openshift-authentication                           oauth-openshift-5c46d6658b-lkbjj                                           0/1     Pending             0                156m
openshift-cloud-credential-operator                pod-identity-webhook-77bf7c646d-4rtn8                                      0/1     ContainerCreating   0                156m
openshift-cluster-api                              capa-controller-manager-d484bc464-lhqbk                                    0/1     ContainerCreating   0                156m
openshift-cluster-csi-drivers                      aws-ebs-csi-driver-controller-5668745dcb-jc7fm                             0/11    ContainerCreating   0                156m
openshift-cluster-csi-drivers                      aws-ebs-csi-driver-operator-5d6b9fbd77-827vs                               0/1     ContainerCreating   0                156m
openshift-cluster-csi-drivers                      shared-resource-csi-driver-operator-866d897954-z77gz                       0/1     ContainerCreating   0                156m
openshift-cluster-csi-drivers                      shared-resource-csi-driver-webhook-d794748dc-kctkn                         0/1     ContainerCreating   0                156m
openshift-cluster-samples-operator                 cluster-samples-operator-754758b9d7-nbcc9                                  0/2     ContainerCreating   0                156m
openshift-cluster-storage-operator                 csi-snapshot-controller-6d9c448fdd-wdb7n                                   0/1     ContainerCreating   0                156m
openshift-cluster-storage-operator                 csi-snapshot-webhook-6966f555f8-cbdc7                                      0/1     ContainerCreating   0                156m
openshift-console-operator                         console-operator-7d8567876b-nxgpj                                          0/2     ContainerCreating   0                156m
openshift-console                                  console-855f66f4f8-q869k                                                   0/1     ContainerCreating   0                156m
openshift-console                                  downloads-7b645b6b98-7jqfw                                                 0/1     ContainerCreating   0                156m
openshift-controller-manager                       controller-manager-548c7f97fb-bl68p                                        0/1     Pending             0                156m
openshift-etcd                                     installer-13-ip-10-0-76-132.us-east-2.compute.internal                     0/1     ContainerCreating   0                9m39s
openshift-etcd                                     installer-3-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h13m
openshift-etcd                                     installer-4-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h12m
openshift-etcd                                     installer-5-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h7m
openshift-etcd                                     installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h1m
openshift-etcd                                     installer-8-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-10-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                160m
openshift-etcd                                     revision-pruner-10-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                160m
openshift-etcd                                     revision-pruner-11-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                159m
openshift-etcd                                     revision-pruner-11-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                159m
openshift-etcd                                     revision-pruner-11-ip-10-0-79-159.us-east-2.compute.internal               0/1     Completed           0                156m
openshift-etcd                                     revision-pruner-12-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                156m
openshift-etcd                                     revision-pruner-12-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                156m
openshift-etcd                                     revision-pruner-12-ip-10-0-79-159.us-east-2.compute.internal               0/1     Completed           0                156m
openshift-etcd                                     revision-pruner-13-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                155m
openshift-etcd                                     revision-pruner-13-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                155m
openshift-etcd                                     revision-pruner-13-ip-10-0-76-132.us-east-2.compute.internal               0/1     ContainerCreating   0                10m
openshift-etcd                                     revision-pruner-13-ip-10-0-79-159.us-east-2.compute.internal               0/1     Completed           0                155m
openshift-etcd                                     revision-pruner-6-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-etcd                                     revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                3h57m
openshift-etcd                                     revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-9-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                166m
openshift-etcd                                     revision-pruner-9-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                166m
openshift-kube-apiserver                           installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h4m
openshift-kube-apiserver                           installer-7-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                168m
openshift-kube-apiserver                           installer-9-ip-10-0-76-132.us-east-2.compute.internal                      0/1     ContainerCreating   0                9m52s
openshift-kube-apiserver                           revision-pruner-6-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-kube-apiserver                           revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                3h59m
openshift-kube-apiserver                           revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                168m
openshift-kube-apiserver                           revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                168m
openshift-kube-apiserver                           revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                166m
openshift-kube-apiserver                           revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                166m
openshift-kube-apiserver                           revision-pruner-8-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                156m
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-48-21.us-east-2.compute.internal                 0/1     ContainerCreating   0                155m
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                155m
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-76-132.us-east-2.compute.internal                0/1     ContainerCreating   0                9m54s
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                155m
openshift-kube-controller-manager                  installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h11m
openshift-kube-controller-manager                  installer-7-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h7m
openshift-kube-controller-manager                  installer-8-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                169m
openshift-kube-controller-manager                  installer-8-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h4m
openshift-kube-controller-manager                  installer-8-ip-10-0-79-159.us-east-2.compute.internal                      0/1     Completed           0                156m
openshift-kube-controller-manager                  revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h13m
openshift-kube-controller-manager                  revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h10m
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h5m
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-76-132.us-east-2.compute.internal                0/1     ContainerCreating   0                4m36s
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                156m
openshift-kube-scheduler                           installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h11m
openshift-kube-scheduler                           installer-7-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                169m
openshift-kube-scheduler                           installer-7-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h10m
openshift-kube-scheduler                           installer-7-ip-10-0-79-159.us-east-2.compute.internal                      0/1     Completed           0                156m
openshift-kube-scheduler                           revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h13m
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h10m
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-76-132.us-east-2.compute.internal                0/1     ContainerCreating   0                4m36s
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                156m
openshift-machine-config-operator                  machine-config-controller-55b4d497b6-p89lb                                 0/2     ContainerCreating   0                156m
openshift-marketplace                              qe-app-registry-w8gnc                                                      0/1     ContainerCreating   0                148m
openshift-monitoring                               prometheus-operator-776bd79f6d-vz7q5                                       0/2     ContainerCreating   0                156m
openshift-multus                                   multus-admission-controller-5f88d77b65-nzmj5                               0/2     ContainerCreating   0                156m
openshift-oauth-apiserver                          apiserver-7b65bbc76b-mxl99                                                 0/1     Init:0/1            0                154m
openshift-operator-lifecycle-manager               collect-profiles-27879975-fpvzk                                            0/1     Completed           0                3h21m
openshift-operator-lifecycle-manager               collect-profiles-27879990-86rk8                                            0/1     Completed           0                3h6m
openshift-operator-lifecycle-manager               collect-profiles-27880005-bscc4                                            0/1     Completed           0                171m
openshift-operator-lifecycle-manager               collect-profiles-27880170-s8cbj                                            0/1     ContainerCreating   0                4m37s
openshift-operator-lifecycle-manager               packageserver-6f8f8f9d54-4r96h                                             0/1     ContainerCreating   0                156m
openshift-ovn-kubernetes                           ovnkube-master-lr9pk                                                       3/6     CrashLoopBackOff    23 (46s ago)     156m
openshift-route-controller-manager                 route-controller-manager-747bf8684f-5vhwx                                  0/1     Pending             0                156m
liuhuali@Lius-MacBook-Pro huali-test % 

Actual results:

RollingUpdate cannot complete successfully

Expected results:

RollingUpdate should complete successfully

Additional info:

Must gather - https://drive.google.com/file/d/1bvE1XUuZKLBGmq7OTXNVCNcFZkqbarab/view?usp=sharing

must gather of another cluster hit the same issue (also this template ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-techpreview-ci and with ovn network): https://drive.google.com/file/d/1CqAJlqk2wgnEuMo3lLaObk4Nbxi82y_A/view?usp=sharing

must gather of another cluster hit the same issue (this template ipi-on-aws/versioned-installer-private_cluster-sts-usgov-ci and with ovn network):
https://drive.google.com/file/d/1tnKbeqJ18SCAlJkS80Rji3qMu3nvN_O8/view?usp=sharing
 
Seems this template ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-techpreview-ci and with ovn network can often hit this issue.

Description of problem:

When use the command `oc-mirror --config config-oci-target.yaml  docker://localhost:5000  --use-oci-feature  --dest-use-http  --dest-skip-tls` , the command exit with code 0, but print log like : unable to parse reference oci://mno/redhat-operator-index:v4.12: lstat /mno: no such file or directory.

Version-Release number of selected component (if applicable):

oc-mirror version 
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202303011628.p0.g2e3885b.assembly.stream-2e3885b", GitCommit:"2e3885b469ee7d895f25833b04fd609955a2a9f6", GitTreeState:"clean", BuildDate:"2023-03-01T16:49:12Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1. with imagesetconfig like : 
cat config-oci-target.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /home/ocmirrortest/0302/60597
mirror:
  operators:
  - catalog: oci:///home/ocmirrortest/noo/redhat-operator-index
    targetCatalog: mno/redhat-operator-index
    targetTag: v4.12
    packages:
    - name: aws-load-balancer-operator
`oc-mirror --config config-oci-target.yaml  docker://localhost:5000  --use-oci-feature  --dest-use-http  --dest-skip-tls`


Actual results:

1. the command exit with code 0, but print strange logs like:
sha256:95c45fae0ca9e9bee0fa2c13652634e726d8133e4e3009b363fcae6814b3461d localhost:5000/albo/aws-load-balancer-rhel8-operator:95c45f
sha256:ab38b37c14f7f0897e09a18eca4a232a6c102b76e9283e401baed832852290b5 localhost:5000/albo/aws-load-balancer-rhel8-operator:ab38b3
info: Mirroring completed in 43.87s (28.5MB/s)
Rendering catalog image "localhost:5000/mno/redhat-operator-index:v4.12" with file-based catalog 
Writing image mapping to oc-mirror-workspace/results-1677743154/mapping.txt
Writing CatalogSource manifests to oc-mirror-workspace/results-1677743154
Writing ICSP manifests to oc-mirror-workspace/results-1677743154
unable to parse reference oci://mno/redhat-operator-index:v4.12: lstat /mno: no such file or directory

Expected results:

no such log  .

 

This is a clone of issue OCPBUGS-10887. The following is the description of the original issue:

Description of problem:

Following https://bugzilla.redhat.com/show_bug.cgi?id=2102765 respectively https://issues.redhat.com/browse/OCPBUGS-2140 problems with OpenID Group sync have been resolved.

Yet the problem documented in https://bugzilla.redhat.com/show_bug.cgi?id=2102765 still does exist and we see that Groups that are being removed are still part of the chache in oauth-apiserver, causing a panic of the respective components and failures during login for potentially affected users.

So in general, it looks like that oauth-apiserver cache is not properly refreshing or handling the OpenID Groups being synced.

E1201 11:03:14.625799       1 runtime.go:76] Observed a panic: interface conversion: interface {} is nil, not *v1.Group
goroutine 3706798 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:103 +0xb0
panic({0x1aeab00, 0xc001400390})
    runtime/panic.go:838 +0x207
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1.1()
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:80 +0x2a
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1()
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:89 +0x250
panic({0x1aeab00, 0xc001400390})
    runtime/panic.go:838 +0x207
github.com/openshift/library-go/pkg/oauth/usercache.(*GroupCache).GroupsFor(0xc00081bf18?, {0xc000c8ac03?, 0xc001400360?})
    github.com/openshift/library-go@v0.0.0-20211013122800-874db8a3dac9/pkg/oauth/usercache/groups.go:47 +0xe7
github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).processGroups(0xc0002c8880, {0xc0005d4e60, 0xd}, {0xc000c8ac03, 0x7}, 0x1?)
    github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:101 +0xb5
github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).UserFor(0xc0002c8880, {0x20f3c40, 0xc000e18bc0})
    github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:83 +0xf4
github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).login(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0xc0015d8200, 0xc001438140?, {0xc0000e7ce0, 0x150})
    github.com/openshift/oauth-server/pkg/oauth/external/handler.go:209 +0x74f
github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).ServeHTTP(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0x0?)
    github.com/openshift/oauth-server/pkg/oauth/external/handler.go:180 +0x74a
net/http.(*ServeMux).ServeHTTP(0x1c9dda0?, {0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    net/http/server.go:2462 +0x149
github.com/openshift/oauth-server/pkg/server/headers.WithRestoreAuthorizationHeader.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:27 +0x10f
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0xc0005e0280?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authorization.go:64 +0x498
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x2f6cea0?, {0x20eebb0?, 0xc00041b058?}, 0x3?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/maxinflight.go:187 +0x2a4
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0x11?, {0x20eebb0?, 0xc00041b058?}, 0x1aae340?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/impersonation.go:50 +0x21c
net/http.HandlerFunc.ServeHTTP(0xc000d52120?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0xc0015d8100?, {0x20eebb0?, 0xc00041b058?}, 0xc000531930?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1({0x7fae682a40d8?, 0xc00041b048}, 0x9dbbaa?)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:111 +0x549
net/http.HandlerFunc.ServeHTTP(0xc00003def0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfd00?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authentication.go:80 +0x8b9
net/http.HandlerFunc.ServeHTTP(0x20f0f20?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:88 +0x46b
net/http.HandlerFunc.ServeHTTP(0xc0019f5890?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc000848764?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithCORS.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/cors.go:75 +0x10b
net/http.HandlerFunc.ServeHTTP(0xc00149a380?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc0008487d0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:108 +0xa2
created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:94 +0x2cc

goroutine 3706802 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19eb780?, 0xc001206e20})
    k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:74 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0016aec60, 0x1, 0x1560f26?})
    k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:48 +0x75
panic({0x19eb780, 0xc001206e20})
    runtime/panic.go:838 +0x207
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0005047c8, {0x20eecd0?, 0xc0010fae00}, 0xdf8475800?)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:114 +0x452
k8s.io/apiserver/pkg/endpoints/filters.withRequestDeadline.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69d00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_deadline.go:101 +0x494
net/http.HandlerFunc.ServeHTTP(0xc0016af048?, {0x20eecd0?, 0xc0010fae00?}, 0xc0000bc138?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69d00)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/waitgroup.go:59 +0x177
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x7fae705daff0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69c00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit_annotations.go:37 +0x230
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69b00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/warning.go:35 +0x2bb
net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20eecd0?, 0xc0010fae00?}, 0xd?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1({0x20eecd0, 0xc0010fae00}, 0x0?)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/cachecontrol.go:31 +0x126
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/httplog.WithLogging.func1({0x20ef480?, 0xc001c20620}, 0xc000e69a00)
    k8s.io/apiserver@v0.22.2/pkg/server/httplog/httplog.go:103 +0x518
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1({0x20ef480, 0xc001c20620}, 0xc000e69900)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/requestinfo.go:39 +0x316
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3f70?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withRequestReceivedTimestampWithClock.func1({0x20ef480, 0xc001c20620}, 0xc000e69800)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_received_time.go:38 +0x27e
net/http.HandlerFunc.ServeHTTP(0x419e2c?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3e40?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1({0x20ef480?, 0xc001c20620?}, 0xc0004ff600?)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/wrap.go:74 +0xb1
net/http.HandlerFunc.ServeHTTP(0x1c05260?, {0x20ef480?, 0xc001c20620?}, 0x8?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withAuditID.func1({0x20ef480, 0xc001c20620}, 0xc000e69600)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/with_auditid.go:66 +0x40d
net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20ef480?, 0xc001c20620?}, 0xd?)
    net/http/server.go:2084 +0x2f
github.com/openshift/oauth-server/pkg/server/headers.WithPreserveAuthorizationHeader.func1({0x20ef480, 0xc001c20620}, 0xc000e69600)
    github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:16 +0xe8
net/http.HandlerFunc.ServeHTTP(0xc0016af9d0?, {0x20ef480?, 0xc001c20620?}, 0x16?)
    net/http/server.go:2084 +0x2f
github.com/openshift/oauth-server/pkg/server/headers.WithStandardHeaders.func1({0x20ef480, 0xc001c20620}, 0x4d55c0?)
    github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0x18f
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20ef480?, 0xc001c20620?}, 0xc0016afac8?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc00098d622?, {0x20ef480?, 0xc001c20620?}, 0xc000401000?)
    k8s.io/apiserver@v0.22.2/pkg/server/handler.go:189 +0x2b
net/http.serverHandler.ServeHTTP({0xc0019f5170?}, {0x20ef480, 0xc001c20620}, 0xc000e69600)
    net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc0002b1720, {0x20f0f58, 0xc0001e8120})
    net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
    net/http/server.go:3071 +0x4db

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11.13

How reproducible:

- Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.11
2. Configure OpenID Group Sync (as per https://docs.openshift.com/container-platform/4.11/authentication/identity_providers/configuring-oidc-identity-provider.html#identity-provider-oidc-CR_configuring-oidc-identity-provider)
3. Have users with hundrets of groups
4. Login and after a while, remove some Groups from the user in the IDP and from OpenShift Container Platform 
5. Try to login again and see the panic in oauth-apiserver

Actual results:

User is unable to login and oauth pods are reporting a panic as shown above

Expected results:

oauth-apiserver should invalidate the cache quickly to remove potential invalid references to non exsting groups

Additional info:

 

Description of problem:

New machine stuck in Provisioning when delete one zone from cpms on gcp , report "The resource 'projects/openshift-qe/global/networks/zhsun-gcp-wn984-network' was not found"

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-12-024338

How reproducible:

Always

Steps to Reproduce:

1. Set up an gcp private cluster, by default cpms contains a,b,c,f four failureDomains, 3 masters are in a,b,c
      failureDomains:
        gcp:
        - zone: us-central1-a
        - zone: us-central1-b
        - zone: us-central1-c
        - zone: us-central1-f
$ oc get machine       
NAME                             PHASE          TYPE            REGION        ZONE            AGE
zhsun-gcp-wn984-master-0         Running        n2-standard-4   us-central1   us-central1-a   33m
zhsun-gcp-wn984-master-1         Running        n2-standard-4   us-central1   us-central1-b   33m
zhsun-gcp-wn984-master-2         Running        n2-standard-4   us-central1   us-central1-c   33m
zhsun-gcp-wn984-worker-a-hlcmd   Running        n2-standard-4   us-central1   us-central1-a   27m
zhsun-gcp-wn984-worker-b-4249t   Running        n2-standard-4   us-central1   us-central1-b   27m
zhsun-gcp-wn984-worker-c-8qcjq   Running        n2-standard-4   us-central1   us-central1-c   27m
2. Delete one failureDomain a, now failureDomains look like below:
      failureDomains:
        gcp:
        - zone: us-central1-b
        - zone: us-central1-c
        - zone: us-central1-f
3. Check machines

Actual results:

New master stuck in Provisioning status. 
$ oc get machine            
NAME                             PHASE          TYPE            REGION        ZONE            AGE
zhsun-gcp-wn984-master-0         Running        n2-standard-4   us-central1   us-central1-a   85m
zhsun-gcp-wn984-master-1         Running        n2-standard-4   us-central1   us-central1-b   85m
zhsun-gcp-wn984-master-2         Running        n2-standard-4   us-central1   us-central1-c   85m
zhsun-gcp-wn984-master-mb7rw-0   Provisioning   n2-standard-4   us-central1   us-central1-f   52m
zhsun-gcp-wn984-worker-a-hlcmd   Running        n2-standard-4   us-central1   us-central1-a   79m
zhsun-gcp-wn984-worker-b-4249t   Running        n2-standard-4   us-central1   us-central1-b   79m
zhsun-gcp-wn984-worker-c-8qcjq   Running        n2-standard-4   us-central1   us-central1-c   79m
 $ oc logs -f machine-api-controllers-6678fc6587-hdl5k -c machine-controller
E0213 09:08:00.059876       1 actuator.go:54] zhsun-gcp-wn984-master-mb7rw-0 error: zhsun-gcp-wn984-master-mb7rw-0: reconciler failed to Update machine: failed to register instance to instance group: failed to ensure that instance group zhsun-gcp-wn984-master-us-central1-f is a proper instance group: failed to register the new instance group named zhsun-gcp-wn984-master-us-central1-f: instanceGroupInsert request failed: googleapi: Error 404: The resource 'projects/openshift-qe/global/networks/zhsun-gcp-wn984-network' was not found, notFound
E0213 09:08:00.059929       1 controller.go:315] zhsun-gcp-wn984-master-mb7rw-0: error updating machine: zhsun-gcp-wn984-master-mb7rw-0: reconciler failed to Update machine: failed to register instance to instance group: failed to ensure that instance group zhsun-gcp-wn984-master-us-central1-f is a proper instance group: failed to register the new instance group named zhsun-gcp-wn984-master-us-central1-f: instanceGroupInsert request failed: googleapi: Error 404: The resource 'projects/openshift-qe/global/networks/zhsun-gcp-wn984-network' was not found, notFound, retrying in 30s seconds
I0213 09:08:00.060001       1 recorder.go:103] events "msg"="zhsun-gcp-wn984-master-mb7rw-0: reconciler failed to Update machine: failed to register instance to instance group: failed to ensure that instance group zhsun-gcp-wn984-master-us-central1-f is a proper instance group: failed to register the new instance group named zhsun-gcp-wn984-master-us-central1-f: instanceGroupInsert request failed: googleapi: Error 404: The resource 'projects/openshift-qe/global/networks/zhsun-gcp-wn984-network' was not found, notFound" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsun-gcp-wn984-master-mb7rw-0","uid":"b973d674-dd26-477d-a68d-6bcedc5f1011","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"36164"} "reason"="FailedUpdate" "type"="Warning"

Expected results:

New master should be Running

Additional info:

 

Description of problem:

update sample imagestreams with latest 4.11 image using specific image tag reference

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

User Story:

As a dev, I want to be able to evaluate alternatives to `golint` which is deprecated and frozen [1]. This alternative should be well supported and in constant development and has to be able to run in CI for presubmit jobs

so that I can achieve

  • find bugs and performance issues, offers simplifications, and enforces style rules during development.

Acceptance Criteria:

Linting alternatives are considered and one or more are chosen to be integrated into our development tooling (`openshift/hack/*`).

(optional) Out of Scope:

 

Engineering Details:

 

This does not require a design proposal.
This does not require a feature gate.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27568/pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade/1595382748753694720

"[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" - this one appears but shows no results, curious why that is but it might be getting ignored in sippy somehow, it's a meta test that encompasses others, we may skip it

But the actual failed test does not appear at all in the risk analysis html?

"disruption_tests: [sig-network-edge] Application behind service load balancer with PDB remains available using reused connections"

Description of the problem:

Regression:
Currently, the message user sees when the cluster doesn't contain the composition of control plane nodes and workers is not helping him to solve the issue and also using the obsolete term "master".

How reproducible:

1000%

Steps to reproduce:

1. create a cluster with no hosts

2. watch the message in the "Host discovery" step

3. enjoy

Actual results:

"Clusters must have at most %d dedicated masters. Please check your configuration and add or remove hosts as needed to meet the requirement."

Expected results:

"Clusters must have exactly %d dedicated control plane nodes and optionally additional workers. Please check your configuration and add or remove hosts to meet the above requirement."

Allison Wolfe please advise.

This is a clone of issue OCPBUGS-12951. The following is the description of the original issue:

Description of problem:

4.13.0-RC.6 Enter to Cluster status: error while trying to install cluster with agent base installer
After the read disk stage the cluster status turn to "error"

Version-Release number of selected component (if applicable):


How reproducible:

Create image with the attached install config and agent config file and boot node with this images

Steps to Reproduce:

1. Create image with the attached install config and agent config file and boot node with this images

Actual results:

Cluster status: error

Expected results:

Should continue with cluster status: installing 

Additional info:


User Story:

As a developer, I want to:

  • Add a line in the documentation to remove the CPMS manifest before create cluster

so that I can

  • Inform users to not to create the manifest.

Acceptance Criteria:

Description of criteria:

  • Installer internal docs for Azure and GCP UPI are update with a remove command.

This does not require a design proposal.
This does not require a feature gate.

Description of problem:

When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message:
the service account does not have any API secret sa=testx-ns/testx-sa
This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace  is causing the issue.
The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.

Version-Release number of selected component (if applicable):

4.11.0 & 4.11.1

How reproducible:

100% reproducible

Steps to Reproduce:

1.Create namespace: oc new-project testx-ns
2. oc apply -f api-secret-issue.yaml

Actual results:

 

Expected results:

 

Additional info:

API YAML:

cat api-secret-issue.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: zsecret
  namespace: testx-ns
  annotations:
   kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
  mykey: mypass

apiVersion: v1
kind: ServiceAccount
metadata:
  name: testx-sa
  namespace: testx-ns

kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: testx-og
  namespace: testx-ns
spec:
  serviceAccountName: "testx-sa"
  targetNamespaces:
  - testx-ns

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: testx-role
  namespace: testx-ns
rules:

  • apiGroups: ["*"]
      resources: ["*"]
      verbs: ["*"] 
      

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: testx-rolebinding
  namespace: testx-ns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: testx-role
subjects:

  • kind: ServiceAccount
      name: testx-sa
      namespace: testx-ns

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-operator
  namespace: testx-ns
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace

Description of problem:

The ingress operator has a log message with weird formatting during startup handleSingleNode4Dot11Upgrade function 

Version-Release number of selected component (if applicable):

4.11

How reproducible:

100%

Steps to Reproduce:

1. Install 4.10 single node cluster
2. Upgrade to 4.11

Actual results:

Ingress operator prints badly formatted log message

Expected results:

Ingress operator prints correctly formatted log message

Additional info:

 

This is a clone of issue OCPBUGS-7841. The following is the description of the original issue:

Description of problem:

The hypershift_hostedclusters_failure_conditions metric produced by the HyperShift operator does not report a value of 0 for conditions that no longer apply. The result is that if a hostedcluster had a failure condition at a given point, but that condition has gone away, the metric still reports a count for that condition.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a HostedCluster, watch the hypershift_hostedclusters_failure_conditions metric as failure conditions occur.
2.
3.

Actual results:

A cluster count of 1 with a failure condition is reported even if the failure condition no longer applies.

Expected results:

Once failure conditions no longer apply, 0 clusters with those conditions should be reported.

Additional info:

The metric should report an accurate count for each possible failure condition of all clusters at any given time.

Description of problem:
When resizing the browser window, the PipelineRun task status bar would overlap the status text that says "Succeeded" in the screenshot.

Actual results:
Status text is overlapped by the task status bar

Expected results:
Status text breaks to a newline or gets shortened by "..."

Description of the problem:

Proliant Gen 11 always reports the serial number "PCA_number.ACC", causing all hosts to register with the same UUID.

How reproducible:

100%

Steps to reproduce:

1. Boot two Proliant Gen 11 hosts

2. See that both hosts are updating a single host entry in the service

Actual results:

All hosts with this hardware are assigned the same UUID

Expected results:

Each host should have a unique UUID

Description of problem:

Altering the ImageURL or ExtraKernelParams values in a PreprovisioningImage CR should cause the host to boot using the new image or parameters, but currently the host doesn't respond at all to changes in those fields.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-11-225449

How reproducible:

Always

Steps to Reproduce:

1. Create a BMH
2. Set preprovisioning image image URL
3. Allow host to boot
4. Change image URL or extra kernel params

Actual results:

Host does not reboot

Expected results:

Host reboots using the newly provided image or parameters

Additional info:
BMH:

- apiVersion: metal3.io/v1alpha1
  kind: BareMetalHost
  metadata:
    annotations:
      inspect.metal3.io: disabled
    creationTimestamp: "2023-01-13T16:06:12Z"
    finalizers:
    - baremetalhost.metal3.io
    generation: 4
    labels:
      infraenvs.agent-install.openshift.io: myinfraenv
    name: ostest-extraworker-0
    namespace: assisted-installer
    resourceVersion: "61077"
    uid: 444d7246-3d0a-4188-a8c4-f407ee4f741f
  spec:
    automatedCleaningMode: disabled
    bmc:
      address: redfish+http://192.168.111.1:8000/redfish/v1/Systems/6f45ba9f-251a-46f7-a7a8-10c6ca9231dd
      credentialsName: ostest-extraworker-0-bmc-secret
    bootMACAddress: 00:b2:71:b8:14:4f
    customDeploy:
      method: start_assisted_install
    online: true
  status:
    errorCount: 0
    errorMessage: ""
    goodCredentials:
      credentials:
        name: ostest-extraworker-0-bmc-secret
        namespace: assisted-installer
      credentialsVersion: "44478"
    hardwareProfile: unknown
    lastUpdated: "2023-01-13T16:06:22Z"
    operationHistory:
      deprovision:
        end: null
        start: null
      inspect:
        end: null
        start: null
      provision:
        end: null
        start: "2023-01-13T16:06:22Z"
      register:
        end: "2023-01-13T16:06:22Z"
        start: "2023-01-13T16:06:12Z"
    operationalStatus: OK
    poweredOn: false
    provisioning:
      ID: b5e8c1a9-8061-420b-8c32-bb29a8b35a0b
      bootMode: UEFI
      image:
        url: ""
      raid:
        hardwareRAIDVolumes: null
        softwareRAIDVolumes: []
      rootDeviceHints:
        deviceName: /dev/sda
      state: provisioning
    triedCredentials:
      credentials:
        name: ostest-extraworker-0-bmc-secret
        namespace: assisted-installer
      credentialsVersion: "44478"
 

Preprovisioning Image (with changes)

- apiVersion: metal3.io/v1alpha1
  kind: PreprovisioningImage
  metadata:
    creationTimestamp: "2023-01-13T16:06:22Z"
    generation: 1
    labels:
      infraenvs.agent-install.openshift.io: myinfraenv
    name: ostest-extraworker-0
    namespace: assisted-installer
    ownerReferences:
    - apiVersion: metal3.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: BareMetalHost
      name: ostest-extraworker-0
      uid: 444d7246-3d0a-4188-a8c4-f407ee4f741f
    resourceVersion: "56838"
    uid: 37f4da76-0d1c-4e05-b618-2f0ab9d5c974
  spec:
    acceptFormats:
    - initrd
    architecture: x86_64
  status:
    architecture: x86_64
    conditions:
    - lastTransitionTime: "2023-01-13T16:34:26Z"
      message: Image has been created
      observedGeneration: 1
      reason: ImageCreated
      status: "True"
      type: Ready
    - lastTransitionTime: "2023-01-13T16:06:24Z"
      message: Image has been created
      observedGeneration: 1
      reason: ImageCreated
      status: "False"
      type: Error
    extraKernelParams: coreos.live.rootfs_url=https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/boot-artifacts/rootfs?arch=x86_64&version=4.12
      rd.break=initqueue
    format: initrd
    imageUrl: https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/images/79ef3924-ee94-42c6-96c3-2d784283120d/pxe-initrd?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI3OWVmMzkyNC1lZTk0LTQyYzYtOTZjMy0yZDc4NDI4MzEyMGQifQ.YazOZS01NoI7g_eVhLmRNmM6wKVVaZJdWbxuePia46Fo0GMLYtSOp1JTvtcStoT51g7VkSnTf8LBJ0zmbGu3HQ&arch=x86_64&version=4.12
    kernelUrl: https://assisted-image-service-assisted-installer.apps.ostest.test.metalkube.org/boot-artifacts/kernel?arch=x86_64&version=4.12
    networkData: {}

This was found while testing ZTP so in this case the assisted-service controllers are altering the preprovisioning image in response to changes made in the assisted-specific CRs, but I don't think this issue is ZTP specific.
 

Description of the problem:

Installing an IPv6 SNO cluster with DHCP networking is failing because the node contains the wrong IP address after joining the cluster.

How reproducible:

2/2

 

OCP Version:
4.12.0-rc.7

 

Operator Version:
registry-proxy.engineering.redhat.com/rh-osbs/multicluster-engine-mce-operator-bundle:v2.2.0-252

 

Steps to reproduce:

1. Install an SNO cluster with IPv6 DHCP networking

Actual results:

Installation fails because certificate is not valid for the IP address

 

Expected results:

SNO node has the correct IP address and installation completes successfully

Discovered in the must gather kubelet_service.log from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1586093220087992320

It appears the guard pod names are too long, and being truncated down to where they will collide with those from the other masters.

From kubelet logs in this run:

❯ grep openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste kubelet_service.log
Oct 28 23:58:55.693391 ci-op-3hj6pnwf-4f6ab-lv57z-master-1 kubenswrapper[1657]: E1028 23:58:55.693346    1657 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-1" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:03.735726 ci-op-3hj6pnwf-4f6ab-lv57z-master-0 kubenswrapper[1670]: E1028 23:59:03.735671    1670 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-0" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:11.168082 ci-op-3hj6pnwf-4f6ab-lv57z-master-2 kubenswrapper[1667]: E1028 23:59:11.168041    1667 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-2" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"

This also looks to be happening for openshift-kube-scheduler-guard, kube-controller-manager-guard, possibly others.

Looks like they should be truncated further to make room for random suffixes in https://github.com/openshift/library-go/blame/bd9b0e19121022561dcd1d9823407cd58b2265d0/pkg/operator/staticpod/controller/guard/guard_controller.go#L97-L98

Unsure of the implications here, it looks a little scary.

Description of the problem:
In some scenarios, if a host inventory is nil, it is possible for the noIPCollisionsInNetwork host validation to crash with a nil pointer dereference panic.
 
How reproducible:
Whereas it is unclear how exactly this is reproduced (most likely a timing issue) it is understood that the validation is missing a nil check to prevent this problem.

Actual results:
There are intermittent crashes of the assisted-service when this validation panics
 

Expected results:
The validation should not panic.

This is a clone of issue OCPBUGS-8305. The following is the description of the original issue:

Description of problem:

machine-config-operator will fail on clusters deployed with IPI on Power Virtual Server with the following error:

Cluster not available for []: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: spec.infra.status.platformStatus.powervs.resourceGroup: Invalid value: "": spec.infra.status.platformStatus.powervs.resourceGroup in body should match '^[a-zA-Z0-9-_ 

Version-Release number of selected component (if applicable):

4.14 and 4.13

How reproducible:

100%

Steps to Reproduce:

1. Deploy with openshift-installer to Power VS
2. Wait for masters to start deploying
3. Error will appear for the machine-config CO

Actual results:

MCO fails

Expected results:

MCO should come up

Additional info:

Fix has been identified

Description of problem:

In https://issues.redhat.com/browse/OCPBUGSM-46450, the VIP was added to noProxy for StackCloud but it should also be added for all national clouds.

Version-Release number of selected component (if applicable):

4.10.20

How reproducible:

always

Steps to Reproduce:

1. Set up a proxy
2. Deploy a cluster in a national cloud using the proxy
3.

Actual results:

Installation fails

Expected results:

 

Additional info:

The inconsistence was discovered when testing the cluster-network-operator changes https://issues.redhat.com/browse/OCPBUGS-5559

Description of problem:

When attempting to load ISO to the remote server, the InsertMedia request fails with `Base.1.5.PropertyMissing`. The system is Mt.Jade Server / GIGABYTE G242-P36. BMC is provided by Megarac.

Version-Release number of selected component (if applicable):

OCP 4.12

How reproducible:

Always

Steps to Reproduce:

1. Create a BMH against such server
2. Create InfraEnv and attempt provisioning

Actual results:

Image provisioning failed: Deploy step deploy.deploy failed with BadRequestError: HTTP POST https://192.168.53.149/redfish/v1/Managers/Self/VirtualMedia/CD1/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.5.PropertyMissing: The property TransferProtocolType is a required property and must be included in the request. Extended information: [{'@odata.type': '#Message.v1_0_8.Message', 'Message': 'The property TransferProtocolType is a required property and must be included in the request.', 'MessageArgs': ['TransferProtocolType'], 'MessageId': 'Base.1.5.PropertyMissing', 'RelatedProperties': ['#/TransferProtocolType'], 'Resolution': 'Ensure that the property is in the request body and has a valid value and resubmit the request if the operation failed.', 'Severity': 'Warning'}].

Expected results:

Image provisioning to work

Additional info:

The following patch attempted to fix the problem: https://opendev.org/openstack/sushy/commit/ecf1bcc80bd14a1836d015c3dbdb4fd88f2bbd75

but the response code checked by the logic in the patch above is `Base.1.5.ActionParameterMissing` whic doesn’t quite address the response code I’m getting, which is Base.1.5.PropertyMissing

 

 

 

Description of problem:

Create Serverless Function form breaks if Pipeline Operator is not installed

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. Setup Serverless and Pipeline Operator should not be installed
2. Navigate to Create Serverless Function form
3. Enter Git URL https://github.com/vikram-raj/hello-func-node 

Actual results:

Page breaks

Expected results:

Page should not break and user can create Serverless Funciton

Additional info:

 

Description of problem:

`create a project` link is enabled for users who do not have permission to create a project. This issue surfaces itself in the developer sandbox.

Version-Release number of selected component (if applicable):

4.11.5

How reproducible:

 

Steps to Reproduce:

1. log into dev sandbox, or a cluster where the user does not have permission to create a project
2. go directly to URL /topology/all-namespaces

Actual results:

`create a project` link is enabled. Upon clicking the link and submitting the form, the project fails to create; as expected.

Expected results:

`create a project` link should only be available to users with the correct permissions.

Additional info:

The project list pages are not directly available to the user in the UI through the project selector. The user must go directly to the URL.

It's possible to encounter this situation when a user logs in with multiple accounts and returns to a previous url.

 

I think something is wrong with the alerts refactor, or perhaps my sync to 4.12.

Failed: suite=[openshift-tests], [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

Passed 1 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success

We're not getting the passes - from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.12-micro-release-openshift-release-analysis-aggregator/1592021681235300352, the successful runs don't show any record of the test at all. We need to record successes and failures for aggregation to work right.

Description of problem:

The function desiredIngressClass doesn't specify ingressClass.spec.parameters.scope, while the ingressClass API object specifies "Cluster" by default.

This causes unneeded updates to all IngressClasses when the CIO starts. The CIO will fight with the API default any time an update triggers a change in an IngressClass.

Reference: https://github.com/kubernetes/api/blob/master/networking/v1/types.go#L640 

Version-Release number of selected component (if applicable):

4.8+

How reproducible:

 

Steps to Reproduce:

We really need https://issues.redhat.com/browse/OCPBUGS-6700 to be fixed before we can identify these spirituous updates. But when it is fixed:

# Delete CIO
oc delete pod -n openshift-ingress-operator  $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}')

# Wait a minute for it to start back up
# Should be NO updates to IngressClasses
oc logs -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') -c ingress-operator | grep "updated IngressClass"

# Instead, we see this every time CIO starts up
2023-01-26T20:57:15.281Z    INFO    operator.ingressclass_controller    ingressclass/ingressclass.go:63    updated IngressClass    {"name": "openshift-default",  

Actual results:

2023-01-26T20:57:15.281Z    INFO    operator.ingressclass_controller    ingressclass/ingressclass.go:63    updated IngressClass    {"name": "openshift-default", ...

Expected results:

No update to ingress upon CIO restart

Additional info:

 

We want to label hosts in controller as quick as possible.

Need to move code to run in parallel with all other go routines and once host is part of cluster set label.

While matching k8s and AI host objects we need to ensure that even if hostname differs we will still match the host by ip, same as in other parts.

This is a clone of issue OCPBUGS-8349. The following is the description of the original issue:

Description of problem:

On a freshly installed cluster, the control-plane-machineset-operator begins rolling a new master node, but the machine remains in a Provisioned state and never joins as a node.

Its status is:
Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}]

The cluster is left in this state until an admin manually removes the stuck master node, at which point a new master machine is provisioned and successfully joins the cluster.

Version-Release number of selected component (if applicable):

4.12.4

How reproducible:

Observed at least 4 times over the last week, but unsure on how to reproduce.

Actual results:

A master node remains in a stuck Provisioned state and requires manual deletion to unstick the control plane machine set process.

Expected results:

No manual interaction should be necessary.

Additional info:

 

Description of problem:

cluster-version-operator pod crashloop during the bootstrap process might be leading to a longer bootstrap process causing the installer to timeout and fail.

The cluster-version-operator pod is continuously restarting due to a go panic. The bootstrap process fails due to the timeout although it completes the process correctly after more time, once the cluster-version-operator pod runs correctly.

$ oc -n openshift-cluster-version logs -p cluster-version-operator-754498df8b-5gll8
I0919 10:25:05.790124       1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4                                                                                                                    
F0919 10:25:05.791580       1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused                                                        
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc00017d5e0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc000089140, 0x1, ...})                                                                                                                   
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
main.init.3.func1(0xc00012ac80?, {0x1b96f60?, 0x6?, 0x6?})
        /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
github.com/spf13/cobra.(*Command).execute(0xc00012ac80, {0xc0002fea20, 0x6, 0x6})
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
main.main()
        /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-18-234318

How reproducible:

Most of the times, with any network type and installation type (IPI, UPI and proxy).

Steps to Reproduce:

1. Install OCP 4.12 IPI
   $ openshift-install create cluster
2. Wait until bootstrap is completed

Actual results:

[...]
level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS        AGE 
openshift-cluster-version                          cluster-version-operator-754498df8b-5gll8                    0/1     CrashLoopBackOff   7 (3m21s ago)   24m 
openshift-image-registry                           image-registry-94fd8b75c-djbxb                               0/1     Pending            0               6m44s 
openshift-image-registry                           image-registry-94fd8b75c-ft66c                               0/1     Pending            0               6m44s 
openshift-ingress                                  router-default-64fbb749b4-cmqgw                              0/1     Pending            0               13m   
openshift-ingress                                  router-default-64fbb749b4-mhtqx                              0/1     Pending            0               13m   
openshift-monitoring                               prometheus-operator-admission-webhook-6d8cb95cf7-6jn5q       0/1     Pending            0               14m 
openshift-monitoring                               prometheus-operator-admission-webhook-6d8cb95cf7-r6nnk       0/1     Pending            0               14m 
openshift-network-diagnostics                      network-check-source-8758bd6fc-vzf5k                         0/1     Pending            0               18m 
openshift-operator-lifecycle-manager               collect-profiles-27726375-hlq89                              0/1     Pending            0               21m 
$ oc -n openshift-cluster-version describe pod cluster-version-operator-754498df8b-5gll8
Name:                 cluster-version-operator-754498df8b-5gll8
Namespace:            openshift-cluster-version                                                            
Priority:             2000000000              
Priority Class Name:  system-cluster-critical                                                       
Node:                 ostest-4gtwr-master-1/10.196.0.68
Start Time:           Mon, 19 Sep 2022 10:17:41 +0000                       
Labels:               k8s-app=cluster-version-operator
                      pod-template-hash=754498df8b
Annotations:          openshift.io/scc: hostaccess 
Status:               Running                      
IP:                   10.196.0.68
IPs:                 
  IP:           10.196.0.68
Controlled By:  ReplicaSet/cluster-version-operator-754498df8b
Containers:        
  cluster-version-operator:
    Container ID:  cri-o://1e2879600c89baabaca68c1d4d0a563d4b664c507f0617988cbf9ea7437f0b27
    Image:         registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69                                                                                                             
    Image ID:      registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69
    Port:          <none>                                                                                                                                                                                                                    
    Host Port:     <none>                                                                                                                                                                                                                    
    Args:                                                     
      start                                                                                                                                                                                                                                  
      --release-image=registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69                                                                                                          
      --enable-auto-update=false                                                                                                                                                                                                             
      --listen=0.0.0.0:9099                                                  
      --serving-cert-file=/etc/tls/serving-cert/tls.crt
      --serving-key-file=/etc/tls/serving-cert/tls.key                                                                                                                                                                                       
      --v=2             
    State:       Waiting 
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   I0919 10:33:07.798614       1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4
F0919 10:33:07.800115       1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
goroutine 1 [running]:                                                                                                                                                                                                                [43/497]
k8s.io/klog/v2.stacks(0x1)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc000433ea0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc0002d6630, 0x1, ...})
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
main.init.3.func1(0xc0003b4f00?, {0x1b96f60?, 0x6?, 0x6?})
  /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
github.com/spf13/cobra.(*Command).execute(0xc0003b4f00, {0xc000311980, 0x6, 0x6})
  /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
  /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
  /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
main.main()
  /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46      Exit Code:    255
      Started:      Mon, 19 Sep 2022 10:33:07 +0000
      Finished:     Mon, 19 Sep 2022 10:33:07 +0000
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:     20m
      memory:  50Mi
    Environment:
      KUBERNETES_SERVICE_PORT:  6443
      KUBERNETES_SERVICE_HOST:  127.0.0.1
      NODE_NAME:                 (v1:spec.nodeName)
      CLUSTER_PROFILE:          self-managed-high-availability
    Mounts:
      /etc/cvo/updatepayloads from etc-cvo-updatepayloads (ro)
      /etc/ssl/certs from etc-ssl-certs (ro)
      /etc/tls/service-ca from service-ca (ro)
      /etc/tls/serving-cert from serving-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
onditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  etc-ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs
    HostPathType:
  etc-cvo-updatepayloads:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cvo/updatepayloads
    HostPathType:
  serving-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-version-operator-serving-cert
    Optional:    false
  service-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      openshift-service-ca.crt
    Optional:  false
  kube-api-access:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3600
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  25m                   default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  21m                   default-scheduler  0/2 nodes are available: 2 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/2 nodes are available: 2 Preemption is no
t helpful for scheduling.
  Normal   Scheduled         19m                   default-scheduler  Successfully assigned openshift-cluster-version/cluster-version-operator-754498df8b-5gll8 to ostest-4gtwr-master-1 by ostest-4gtwr-bootstrap
  Warning  FailedMount       17m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[service-ca kube-api-access etc-ssl-certs etc-cvo-updatepayloads serving-cert]:
timed out waiting for the condition
  Warning  FailedMount       17m (x9 over 19m)     kubelet            MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
  Normal   Pulling           15m                   kubelet            Pulling image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69"
  Normal   Pulled            15m                   kubelet            Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" in 7.481824271s
  Normal   Started           14m (x3 over 15m)     kubelet            Started container cluster-version-operator
  Normal   Created           14m (x4 over 15m)     kubelet            Created container cluster-version-operator
  Normal   Pulled            14m (x3 over 15m)     kubelet            Container image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" already present on machine
  Warning  BackOff           4m22s (x52 over 15m)  kubelet            Back-off restarting failed container
  
  

Expected results:

No panic?

Additional info:

Seen in most of OCP on OSP QE CI jobs.

Attached must-gather-install.tar.gz

Description of problem:

$ RELEASE_VERSION=4.10.37
$ RELEASE_FOLDER=/tmp/release
$ oc image extract quay.io/openshift-release-dev/ocp-release:${RELEASE_VERSION}-x86_64 --path /:${RELEASE_FOLDER} --confirm
error: unable to extract layer sha256:213de71dc0c6c48e5d312a10913b966b471faa62153ba2bfcaaf5133101245f5 from quay.io/openshift-release-dev/ocp-release:4.10.37-x86_64: platform and architecture is not supported

Version-Release number of selected component (if applicable):

oc 4.11.13. Likely other versions too.

How reproducible:

100%

Steps to Reproduce:

1. Get a Darwin box.
2. Run the earlier commands.

Actual results:

platform and architecture is not supported

Expected results:

Successful extraction.

Additional info:

  • xattr extraction uses Lsetxattr.
  • On unsupported systems (apparently including Darwin), that can return ErrNotSupportedPlatform, which is the platform and architecture is not supported string.
  • A later LUtimesNano call has an explicit err != system.ErrNotSupportedPlatform guard. We probably want a similar guard around Lsetxattr.
  • And to make debugging easier, we probably want more error-wrapping so it's easier to see that it was xattr that were the issue.

Pinning down the xattr-ness, this issue also crops up with the 4.11.13 oc attempting to extract 4.11.9, but not when extracting 4.11.8. Checking that third layer:

$ oc image info -o json quay.io/openshift-release-dev/ocp-release:4.11.9-x86_64 | jq -c '.layers[]'
{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","size":79524639,"digest":"sha256:97da74cc6d8fa5d1634eb1760fd1da5c6048619c264c23e62d75f3bf6b8ef5c4"}
{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","size":1438,"digest":"sha256:d8190195889efb5333eeec18af9b6c82313edd4db62989bd3a357caca4f13f0e"}
{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","size":7273592,"digest":"sha256:a8765b0d2a13463f645e86802a0db82527462d13010d93ed87f01355e1dd3c56"}
{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","size":11913477,"digest":"sha256:371af7f94164b983de702832af58d083abd01f9ffc3b8255e33a5744fb6762b6"}
{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","size":25451521,"digest":"sha256:e834da016fb2cc8f360260f04309273aa5cd530944a3b24280874dbf128e64e9"}
{"mediaType":"application/vnd.docker.image.rootfs.diff.tar.gzip","size":877877,"digest":"sha256:ad7878eb40a1ff52b2be26dcdc9bca1850fdfc71cdea20bf29aef1d4931fdc1b"}
$ mkdir /tmp/release
$ oc --v=8 image extract 'quay.io/openshift-release-dev/ocp-release:4.11.9-x86_64[2]' --path /:/tmp/release
...
I1110 22:17:06.995981    9432 round_trippers.go:463] GET https://cdn02.quay.io/sha256/a8/a8765b0d2a13463f645e86802a0db82527462d13010d93ed87f01355e1dd3c56?...
...
$ curl -s 'https://cdn02.quay.io/sha256/a8/a8765b0d2a13463f645e86802a0db82527462d13010d93ed87f01355e1dd3c56?...' >layer.tar.gz
$ sha256sum layer.tar.gz 
a8765b0d2a13463f645e86802a0db82527462d13010d93ed87f01355e1dd3c56  layer.tar.gz
$ cat main.go 
package main

import (
        "archive/tar"
        "fmt"
        "io"
        "log"
        "os"
)

func main() {
        tr := tar.NewReader(os.Stdin)
        for {
                hdr, err := tr.Next()
                if err == io.EOF {
                        break // End of archive
                }
                if err != nil {
                        log.Fatal(err)
                }
                if len(hdr.Xattrs) > 0 {
                        fmt.Printf("%s: %v\n", hdr.Name, hdr.Xattrs)
                }
        }
}
$ zcat layer.tar.gz | go run main.go
etc/: map[user.overlay.impure:y]
etc/dnf/: map[user.overlay.impure:y]
root/: map[user.overlay.impure:y]
root/buildinfo/: map[user.overlay.impure:y]
usr/: map[user.overlay.impure:y]
usr/share/: map[user.overlay.impure:y]
usr/share/licenses/: map[user.overlay.impure:y]
usr/share/zoneinfo/: map[user.overlay.impure:y]
usr/share/zoneinfo/America/: map[user.overlay.impure:y]
usr/share/zoneinfo/posix/: map[user.overlay.impure:y]
usr/share/zoneinfo/posix/America/: map[user.overlay.impure:y]
usr/share/zoneinfo/right/: map[user.overlay.impure:y]
usr/share/zoneinfo/right/America/: map[user.overlay.impure:y]
var/: map[user.overlay.impure:y]
var/cache/dnf/rhel-8-appstream-rpms-aarch64-f4aa537d908d9fc4/repodata/cf0ed8b0-5d9d-4ce4-a514-d0919c8f9d90: map[user.Librepo.checksum.mtime:1665052295 user.Librepo.checksum.sha256:f31afeb2083e829831737e1976118486f181824396bb1dfa345ce837cfcd01db]
var/cache/dnf/rhel-8-appstream-rpms-ppc64le-e6d00b111ed689e4/repodata/8805d47b-954c-4b0c-bd87-9cd232cae31e: map[user.Librepo.checksum.mtime:1665052278 user.Librepo.checksum.sha256:c207283703ed0689b7df803d2c7567d4f04bebc1056175403d6de005c7afa831]
var/cache/dnf/rhel-8-appstream-rpms-s390x-440fe0b9951ab5ff/repodata/757936f1-08fa-4581-921a-3533f5f1ba22: map[user.Librepo.checksum.mtime:1665052286 user.Librepo.checksum.sha256:29e73b7588342db7b5cc8734d3261d922569bbe123492d9cb5e71ef41a7f5199]
var/cache/dnf/rhel-8-appstream-rpms-x86_64-2eb21a16222429e8/repodata/c281b795-db8a-40e4-bb51-f5823d0a0d4e: map[user.Librepo.checksum.mtime:1665052267 user.Librepo.checksum.sha256:4cdbec83cf275d0dff9ed9c139759937d51cc7be757f3703c9e41f3544489920]
var/cache/dnf/rhel-8-baseos-rpms-aarch64-cab14abc0aecab14/repodata/a9473675-b6c9-457e-8af7-d9991b7d671c: map[user.Librepo.checksum.mtime:1665052335 user.Librepo.checksum.sha256:f31afeb2083e829831737e1976118486f181824396bb1dfa345ce837cfcd01db]
var/cache/dnf/rhel-8-baseos-rpms-ppc64le-a9d8a6aec787f57a/repodata/3100b901-7431-460f-8bc8-1407f9939106: map[user.Librepo.checksum.mtime:1665052315 user.Librepo.checksum.sha256:c207283703ed0689b7df803d2c7567d4f04bebc1056175403d6de005c7afa831]
var/cache/dnf/rhel-8-baseos-rpms-s390x-0afe0eabb7bfab81/repodata/e32dfd89-51f1-418b-a2c7-da8f2a144e9f: map[user.Librepo.checksum.mtime:1665052326 user.Librepo.checksum.sha256:29e73b7588342db7b5cc8734d3261d922569bbe123492d9cb5e71ef41a7f5199]
var/cache/dnf/rhel-8-baseos-rpms-x86_64-fc3934b38ea47b54/repodata/86fa7d79-ca60-420a-a4a2-7899b9b9a9e9: map[user.Librepo.checksum.mtime:1665052304 user.Librepo.checksum.sha256:4cdbec83cf275d0dff9ed9c139759937d51cc7be757f3703c9e41f3544489920]
var/lib/: map[user.overlay.impure:y]
var/lib/rhsm/: map[user.overlay.impure:y]
var/lib/rpm/: map[user.overlay.impure:y]
var/log/: map[user.overlay.impure:y]

But looking at the third layer in 4.11.8, there were no xattr.

Description of problem:

Business Automation Operands fail to load in uninstall operator modal. With "Cannot load Operands. There was an error loading operands for this operator. Operands will need to be deleted manually..." alert message.

"Delete all operand instances for this operator__checkbox" is not shown so the test fails. 

https://search.ci.openshift.org/?search=Testing+uninstall+of+Business+Automation+Operator&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-8258. The following is the description of the original issue:

Invoking 'create cluster-manifests' fails when imageContentSources is missing in install-config yaml:

$ openshift-install agent create cluster-manifests
INFO Consuming Install Config from target directory
FATAL failed to write asset (Mirror Registries Config) to disk: failed to write file: open .: is a directory

install-config.yaml:

apiVersion: v1alpha1
metadata:
  name: appliance
rendezvousIP: 192.168.122.116
hosts:
  - hostname: sno
    installerArgs: '["--save-partlabel", "agent*", "--save-partlabel", "rhcos-*"]'
    interfaces:
     - name: enp1s0
       macAddress: 52:54:00:e7:05:72
    networkConfig:
      interfaces:
        - name: enp1s0
          type: ethernet
          state: up
          mac-address: 52:54:00:e7:05:72
          ipv4:
            enabled: true
            dhcp: true 

(originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1983200)

test:
[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial]

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-etcd%5C%5D%5C%5BFeature%3ADisasterRecovery%5C%5D%5C%5BDisruptive%5C%5D+%5C%5BFeature%3AEtcdRecovery%5C%5D+Cluster+should+restore+itself+after+quorum+loss+%5C%5BSerial%5C%5D

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1413625606435770368
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1415075413717159936

some brief triaging from Thomas Jungblut on:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11/1568747321334697984

it seems the last guard pod doesn't come up, etcd operator installs this properly and the revision installer also does not spout any errors. It just doesn't progress to the latest revision. At first glance doesn't look like an issue with etcd itself, but needs to be taken a closer look at for sure.

Description of problem:

Installer as used with AWS, during a cluster destroy, does a get-all-roles and would delete roles based on a tag. If a customer is using AWS SEA which would deny any roles doing a get-all-roles in the AWS account, the installer fails.

Instead of error-out, the installer should gracefully handle being denied get-all-roles and move onward, so that a denying SCP would not get in the way of a successful cluster destroy on AWS.

Version-Release number of selected component (if applicable):

[ec2-user@ip-172-16-32-144 ~]$ rosa version
1.2.6

How reproducible:

1. Deploy ROSA STS, private with PrivateLink with AWS SEA
2. rosa delete cluster --debug
3. watch the debug logs of the installer to see it try to get-all-roles
4. installer fails when the SCP from AWS SEA denies the get-all-roles task

Steps to Reproduce:  Philip Thomson Would you please fill out the below?

Steps list above.

Actual results:

time="2022-09-01T00:10:40Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=zp56pxql
time="2022-09-01T00:10:40Z" level=error msg="error provisioning cluster" error="exit status 4" installID=zp56pxql
time="2022-09-01T00:10:40Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=zp56pxql


time="2022-09-01T00:12:47Z" level=info msg="copied /installconfig/install-config.yaml to /output/install-config.yaml" installID=55h2cvl5
time="2022-09-01T00:12:47Z" level=info msg="cleaning up resources from previous provision attempt" installID=55h2cvl5
time="2022-09-01T00:12:47Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:48Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:48Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:12:49Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6b4b5144-2f4e-4fde-ba1a-04ed239b84c2" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6152e9c2-9c1c-478b-a5e3-11ff2508684e" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8636f0ff-e984-4f02-870e-52170ab4e7bb" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2385a980-dc9b-480f-955a-62ac1aaa6718" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 02ccef62-14e7-4310-b254-a0731995bd45" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: eca2081d-abd7-4c9b-b531-27ca8758f933" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6bda17e9-83e5-4688-86a0-2f84c77db759" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 827afa4a-8bb9-4e1e-af69-d5e8d125003a" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8dcd0480-6f9e-49cb-a0dd-0c5f76107696" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5095aed7-45de-4ca0-8c41-9db9e78ca5a6" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 04f7d0e0-4139-4f74-8f67-8d8a8a41d6b9" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 115f9514-b78b-42d1-b008-dc3181b61d33" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 68da4d93-a93e-410a-b3af-961122fe8df0" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 012221ea-2121-4b04-91f2-26c31c8458b1" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e6c9328d-a4b9-4e69-8194-a68ed7af6c73" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 214ca7fb-d153-4d0d-9f9c-21b073c5bd35" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: 63b54e82-e2f6-48d4-bd0f-d2663bbc58bf" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: d24982b6-df65-4ba2-a3c0-5ac8d23947e1" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: e2c5737a-5014-4eb5-9150-1dd1939137c0" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7793fa7c-4c8d-4f9f-8f23-d393b85be97c" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: bef2c5ab-ef59-4be6-bf1a-2d89fddb90f1" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: ff04eb1b-9cf6-4fff-a503-d9292ff17ccd" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: 85e05de8-ba16-4366-bc86-721da651d770" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for IAM users" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:57Z" level=info msg=Disassociated id=i-03d7570547d32071d installID=55h2cvl5 name=rosa-mv9dx3-xls7g-master-profile role=ROSA-ControlPlane-Role
time="2022-09-01T00:12:57Z" level=info msg=Deleted InstanceProfileName=rosa-mv9dx3-xls7g-master-profile arn="arn:aws:iam::646284873784:instance-profile/rosa-mv9dx3-xls7g-master-profile" id=i-03d7570547d32071d installID=55h2cvl5
time="2022-09-01T00:12:57Z" level=debug msg=Terminating id=i-03d7570547d32071d installID=55h2cvl5
time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-08bee3857e5265ba4 installID=55h2cvl5
time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-00df6e7b34aa65c9b installID=55h2cvl5
time="2022-09-01T00:13:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:49Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-sint/2e99b98b94304d80 installID=55h2cvl5
time="2022-09-01T00:17:49Z" level=info msg=Deleted id=eni-0e4ee5cf8f9a8fdd2 installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="Revoked ingress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="Revoked egress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="DependencyViolation: resource sg-03265ad2fae661b8c has a dependent object\n\tstatus code: 400, request id: f7c35709-a23d-49fd-ac6a-f092661f6966" arn="arn:aws:ec2:ca-central-1:646284873784:security-group/sg-03265ad2fae661b8c" installID=55h2cvl5
time="2022-09-01T00:17:51Z" level=info msg=Deleted id=eni-0e592a2768c157360 installID=55h2cvl5
time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"rosa-mv9dx3.0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=info msg=Deleted id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=debug msg="Revoked ingress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=debug msg="Revoked egress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-aint/635162452c08e059 installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=eni-049f0174866d87270 installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="no deletions from us-east-1, removing client" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 06b804ae-160c-4fa7-92de-fd69adc07db2" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2a5dd4ad-9c3e-40ee-b478-73c79671d744" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e61daee8-6d2c-4707-b4c9-c4fdd6b5091c" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1b743447-a778-4f9e-8b48-5923fd5c14ce" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da8c8a42-8e79-48e5-b548-c604cb10d6f4" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d7840e4-a1b4-4ea2-bb83-9ee55882de54" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7f2e04ed-8c49-42e4-b35e-563093a57e5b" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: cd2b4962-e610-4cc4-92bc-827fe7a49b48" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: be005a09-f62c-4894-8c82-70c375d379a9" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 541d92f4-33ce-4a50-93d8-dcfd2306eeb0" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6dd81743-94c4-479a-b945-ffb1af763007" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a269f47b-97bc-4609-b124-d1ef5d997a91" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 33c3c0a5-e5c9-4125-9400-aafb363c683c" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 32e87471-6d21-42a7-bfd8-d5323856f94d" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: b2cc6745-0217-44fe-a48b-44e56e889c9e" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 09f81582-6685-4dc9-99f0-ed33565ab4f4" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: cea9116c-2b54-4caa-9776-83559d27b8f8" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 430d7750-c538-42a5-84b5-52bc77ce2d56" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 279038e4-f3c9-4700-b590-9a90f9b8d3a2" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5e2f40ae-3dc7-4773-a5cd-40bf9aa36c03" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 92a27a7b-14f5-455b-aa39-3c995806b83e" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0da4f66c-c6b1-453c-a8c8-dc0399b24bb9" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: f2c94beb-a222-4bad-abe1-8de5786f5e59" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="search for IAM users" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=info msg=Deleted id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="no deletions from ca-central-1, removing client" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0e8e0bea-b512-469b-a996-8722a0f7fa25" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 288456a2-0cd5-46f1-a5d2-6b4006a5dc0e" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 321df940-70fc-45e7-8c56-59fe5b89e84f" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 45bebf36-8bf9-4c78-a80f-c6a5e98b2187" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: eea00ae2-1a72-43f9-9459-a1c003194137" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0ef5a102-b764-4e17-999f-d820ebc1ec12" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 107d0ccf-94e7-41c4-96cd-450b66a84101" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da9bd868-8384-4072-9fb4-e6a66e94d2a1" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 74fbf44c-d02d-4072-b038-fa456246b6a8" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 365116d6-1467-49c3-8f58-1bc005aa251f" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 20f91de5-cfeb-45e0-bb46-7b66d62cc749" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 924fa288-f1b9-49b8-b549-a930f6f771ce" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 4beb233d-40d6-4016-872a-8757af8f98ee" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 77951f62-e0b4-4a9b-a20c-ea40d6432e84" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 13ad38c8-89dc-461d-9763-870eec3a6ba1" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a8fe199d-12fb-4141-a944-c7c5516daf25" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: b487c62f-5ac5-4fa0-b835-f70838b1d178" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 97bfcb55-ae1f-4859-9c12-03de09607f79" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1094f6-714e-4042-9134-75f4c6d9d0df" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1db477-ee6a-4d03-8b57-52b335b2bbe6" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1fc32d09-588b-4d80-ad62-748f7fb55efd" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d906cc2-eaaa-439b-97e0-503615ce5d43" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: ee6a5647-20b1-4880-932b-bfd70b945077" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a424891e-48ab-4ad4-9150-9ef1076dcb9c" installID=55h2cvl5

Repeats the not authroized errors probably 50+ times.

Expected results:

For these errors not to show up during install.

Additional info:

Again this is only due to ROSA being install in an AWS SEA environment - https://github.com/aws-samples/aws-secure-environment-accelerator.

Description of problem:

There is typo in WTO for Milliseconds https://github.com/openshift/console/blob/master/frontend/packages/console-app/src/components/cloud-shell/setup/TimeoutSection.tsx#L13 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.for local dev setup follow https://docs.google.com/document/d/173PFbaTMWHf8PAhXmWomAaSbxmVwO2Dpal_qBpqxLe8/edit# 
2.Install WTO operator 
3.Click on the terminal icon top right header 

Actual results:

unit for Milliseconds is shown Miliseconds (Typo)

Expected results:

unit for Milliseconds should be shown as Milliseconds

Additional info:

https://github.com/openshift/console/pull/12329#discussion_r1064751880

When we create an HCP, the Root CA in the HCP namespaces has the certificate and key named as

  • ca.key
  • ca.crt
    But to cert manager expects them to be named as
  • tls.key
  • tls.cert

Done criteria: The Root CA should have the certificate and key named as the cert manager expects.

 

Using ZTP 4.10 and ACM 2.4

With a cluster already installed, if I try to add extra-workers. ArgoCD cannot correctly synch the AgentClusterInstall object:

The new AgentClusterInstall object generated, tries to modify the num of nodes:

And this is immutable (at least on ACM 2.5):
"Failed sync attempt to e3c95c6341b5760889894459b014ea79873cf621: one or more objects failed to apply, reason: admission webhook "agentclusterinstallvalidators.admission.agentinstall.openshift.io" denied the request: Attempted to change AgentClusterInstall.Spec which is immutable after install started, except for ClusterMetadata fields. Unsupported change: ProvisionRequirements.WorkerAgents: (0 => 2)"

Description of the problem:
Assisted Service github operator doc needs updating regarding the mirror registry configuration section as the oc client bundled with assisted service now supports icsp:

Registries defined in the registries.conf file should use "mirror-by-digest-only = false" mode.

This should be set back to true to match normal icsp behavior

The mirror registry configuration changes the discovery image's ignition config, with ca-bundle.crt written out to /etc/pki/ca-trust/source/anchors/domain.crt and with registries.conf written out to /etc/containers/registries.conf. The configuration also changes the install-config.yaml file used to install a new cluster, with the contents of ca-bundle.crt added to additionalTrustBundle and with the registries defined registries.conf added to imageContentSources as mirrors.

Should add a note in this section that the assisted service pod converts the registries.conf into an icsp file, and then includes it with the --icsp flag in a few oc adm commands run against the release image (run from within the assisted service pod itself).

Also should add that ca-bundle.crt and registries.conf keys can be added individually or together.

Description of problem:

'Status' column on Console plugins page doesn't work

Version-Release number of selected component (if applicable):

cluster-bot cluster 

How reproducible:

Always

Steps to Reproduce:

1. create some plugins which will falls into Failed, Pending and Loaded status
2. check 'Status' column sorting on Console plugins page /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins

Actual results:

'Status' sorting doesn't work

Expected results:

'Status' column sorting should work

Additional info:

 

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

this bug is found when verify OCPBUGS-2873, upgrade from 4.12.0-0.nightly-2022-12-04-160656 to 4.13.0-0.nightly-2022-12-04-194803, Done applying 4.13.0-0.nightly-2022-12-04-194803 at "2022-12-05T03:03:56Z"(UTC time), all targets were UP after upgrade, but since  Dec 5, 2022, 20:47 (UTC time), TargetDown alert for kubelet was fired , see from http://pastebin.test.redhat.com/1084049, all kubelet 10250 targets are down for "server returned HTTP status 401 Unauthorized", kubelet targets include

10250/metrics/cadvisor
10250/metrics
10250/metrics/probes

Version-Release number of selected component (if applicable):

upgrade from 4.12.0-0.nightly-2022-12-04-160656 to 4.13.0-0.nightly-2022-12-04-194803
it affects only 4.13.

How reproducible:

not sure if it's regression issue for https://github.com/openshift/cluster-monitoring-operator/pull/1827 or the issue related to upgrade

Steps to Reproduce:

1. upgrade from 4.12.0-0.nightly-2022-12-04-160656 to 4.13.0-0.nightly-2022-12-04-194803 and check all targets' status
2.
3.

Actual results:

all kubelet targets are down

Expected results:

should not down

Additional info:

this bug affects admin UI, since some grahps use metrics exposed by kubelet, kubelet servicemonitor file see below

# oc -n openshift-monitoring get servicemonitor kubelet -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2022-12-06T02:37:25Z"
  generation: 1
  labels:
    app.kubernetes.io/name: kubelet
    app.kubernetes.io/part-of: openshift-monitoring
    k8s-app: kubelet
  name: kubelet
  namespace: openshift-monitoring
  resourceVersion: "18888"
  uid: 85835270-7ceb-4db9-a51b-f645db0f7329
spec:
  endpoints:
  - bearerTokenSecret:
      key: ""
    honorLabels: true
    interval: 30s
    metricRelabelings:
    - action: drop
      regex: kubelet_(pod_worker_latency_microseconds|pod_start_latency_microseconds|cgroup_manager_latency_microseconds|pod_worker_start_latency_microseconds|pleg_relist_latency_microseconds|pleg_relist_interval_microseconds|runtime_operations|runtime_operations_latency_microseconds|runtime_operations_errors|eviction_stats_age_microseconds|device_plugin_registration_count|device_plugin_alloc_latency_microseconds|network_plugin_operations_latency_microseconds)
      sourceLabels:
      - __name__
    - action: drop
      regex: scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)
      sourceLabels:
      - __name__
    - action: drop
      regex: apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs|longrunning_gauge|registered_watchers)
      sourceLabels:
      - __name__
    - action: drop
      regex: kubelet_docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)
      sourceLabels:
      - __name__
    - action: drop
      regex: reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)
      sourceLabels:
      - __name__
    - action: drop
      regex: etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|object_counts|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)
      sourceLabels:
      - __name__
    - action: drop
      regex: transformation_(transformation_latencies_microseconds|failures_total)
      sourceLabels:
      - __name__
    - action: drop
      regex: (admission_quota_controller_adds|admission_quota_controller_depth|admission_quota_controller_longest_running_processor_microseconds|admission_quota_controller_queue_latency|admission_quota_controller_unfinished_work_seconds|admission_quota_controller_work_duration|APIServiceOpenAPIAggregationControllerQueue1_adds|APIServiceOpenAPIAggregationControllerQueue1_depth|APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds|APIServiceOpenAPIAggregationControllerQueue1_queue_latency|APIServiceOpenAPIAggregationControllerQueue1_retries|APIServiceOpenAPIAggregationControllerQueue1_unfinished_work_seconds|APIServiceOpenAPIAggregationControllerQueue1_work_duration|APIServiceRegistrationController_adds|APIServiceRegistrationController_depth|APIServiceRegistrationController_longest_running_processor_microseconds|APIServiceRegistrationController_queue_latency|APIServiceRegistrationController_retries|APIServiceRegistrationController_unfinished_work_seconds|APIServiceRegistrationController_work_duration|autoregister_adds|autoregister_depth|autoregister_longest_running_processor_microseconds|autoregister_queue_latency|autoregister_retries|autoregister_unfinished_work_seconds|autoregister_work_duration|AvailableConditionController_adds|AvailableConditionController_depth|AvailableConditionController_longest_running_processor_microseconds|AvailableConditionController_queue_latency|AvailableConditionController_retries|AvailableConditionController_unfinished_work_seconds|AvailableConditionController_work_duration|crd_autoregistration_controller_adds|crd_autoregistration_controller_depth|crd_autoregistration_controller_longest_running_processor_microseconds|crd_autoregistration_controller_queue_latency|crd_autoregistration_controller_retries|crd_autoregistration_controller_unfinished_work_seconds|crd_autoregistration_controller_work_duration|crdEstablishing_adds|crdEstablishing_depth|crdEstablishing_longest_running_processor_microseconds|crdEstablishing_queue_latency|crdEstablishing_retries|crdEstablishing_unfinished_work_seconds|crdEstablishing_work_duration|crd_finalizer_adds|crd_finalizer_depth|crd_finalizer_longest_running_processor_microseconds|crd_finalizer_queue_latency|crd_finalizer_retries|crd_finalizer_unfinished_work_seconds|crd_finalizer_work_duration|crd_naming_condition_controller_adds|crd_naming_condition_controller_depth|crd_naming_condition_controller_longest_running_processor_microseconds|crd_naming_condition_controller_queue_latency|crd_naming_condition_controller_retries|crd_naming_condition_controller_unfinished_work_seconds|crd_naming_condition_controller_work_duration|crd_openapi_controller_adds|crd_openapi_controller_depth|crd_openapi_controller_longest_running_processor_microseconds|crd_openapi_controller_queue_latency|crd_openapi_controller_retries|crd_openapi_controller_unfinished_work_seconds|crd_openapi_controller_work_duration|DiscoveryController_adds|DiscoveryController_depth|DiscoveryController_longest_running_processor_microseconds|DiscoveryController_queue_latency|DiscoveryController_retries|DiscoveryController_unfinished_work_seconds|DiscoveryController_work_duration|kubeproxy_sync_proxy_rules_latency_microseconds|non_structural_schema_condition_controller_adds|non_structural_schema_condition_controller_depth|non_structural_schema_condition_controller_longest_running_processor_microseconds|non_structural_schema_condition_controller_queue_latency|non_structural_schema_condition_controller_retries|non_structural_schema_condition_controller_unfinished_work_seconds|non_structural_schema_condition_controller_work_duration|rest_client_request_latency_seconds|storage_operation_errors_total|storage_operation_status_count)
      sourceLabels:
      - __name__
    port: https-metrics
    relabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    scheme: https
    scrapeTimeout: 30s
    tlsConfig:
      ca: {}
      caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt
      cert: {}
      certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
      keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
  - bearerTokenSecret:
      key: ""
    honorLabels: true
    honorTimestamps: false
    interval: 30s
    metricRelabelings:
    - action: drop
      regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
      sourceLabels:
      - __name__
    - action: drop
      regex: (container_spec_.*|container_file_descriptors|container_sockets|container_threads_max|container_threads|container_start_time_seconds|container_last_seen);;
      sourceLabels:
      - __name__
      - pod
      - namespace
    - action: drop
      regex: (container_blkio_device_usage_total);.+
      sourceLabels:
      - __name__
      - container
    - action: drop
      regex: container_memory_failures_total
      sourceLabels:
      - __name__
    - action: replace
      regex: container_fs_usage_bytes
      replacement: "true"
      sourceLabels:
      - __name__
      targetLabel: __tmp_keep_metric
    - action: drop
      regex: ;(container_fs_.*);.+
      sourceLabels:
      - __tmp_keep_metric
      - __name__
      - container
    - action: labeldrop
      regex: __tmp_keep_metric
    path: /metrics/cadvisor
    port: https-metrics
    relabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    scheme: https
    scrapeTimeout: 30s
    tlsConfig:
      ca: {}
      caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt
      cert: {}
      certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
      keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
  - bearerTokenSecret:
      key: ""
    honorLabels: true
    interval: 30s
    path: /metrics/probes
    port: https-metrics
    relabelings:
    - action: replace
      sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
    scheme: https
    scrapeTimeout: 30s
    tlsConfig:
      ca: {}
      caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt
      cert: {}
      certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
      keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
  - bearerTokenSecret:
      key: ""
    interval: 30s
    port: https-metrics
    relabelings:
    - action: replace
      regex: (.+)(?::\d+)
      replacement: $1:9537
      sourceLabels:
      - __address__
      targetLabel: __address__
    - action: replace
      replacement: crio
      sourceLabels:
      - endpoint
      targetLabel: endpoint
    - action: replace
      replacement: crio
      targetLabel: job
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: kubelet

 

Currently, the Dev Sandbox clusters sends the clusterType "OSD" instead of "DEVSANDBOX" because the configuration annotations of the console config are automatically overridden by some SyncSets.

Open Dev Sandbox and browser console and inspect window.SERVER_FLAGS.telemetry

Description of the problem:

In Staging, BE v 2.12.4 - agent upgrade - getting the aforementioned error message while agent upgrades after hosts discovery

In addition, hosts are moving to pending-input status- which implies that the user should add input, but should be, maybe, not ready.

How reproducible:

100%

Steps to reproduce:

1. create cluster and discover hosts. stay in discovery page (no network set)

2. upgrade agent 

3. Getting error messages on failed to register hosts

4. host are in pending-input status

Actual results:

 

Expected results:

 

Description of problem:

Part of epic: https://issues.redhat.com/browse/RHSTOR-3545
UI story: https://issues.redhat.com/browse/RHSTOR-4211
UI task: https://issues.redhat.com/browse/RHSTOR-4329

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-8468. The following is the description of the original issue:

Description of problem:

RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions

Version-Release number of selected component (if applicable):

master/4.14

How reproducible:

always

Steps to Reproduce:

1. openshift-install create install-config
2. Try to select ap-south-2 as a region
3.

Actual results:

New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.

Expected results:

Installer supports and displays the new regions in the Survey

Additional info:

See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23

 

This is a clone of issue OCPBUGS-10343. The following is the description of the original issue:

Description of problem:

When deploying hosts using ironic's agent both the ironic service address and inspector address are required.

The ironic service is proxied such that it can be accessed at a consistent endpoint regardless of where the pod is running. This is not the case for the inspection service.

This means that if the inspection service moves after we find the address, provisioning will fail.

In particular this non-matching behavior is frustrating when using the CBO [GetIronicIP function|https://github.com/openshift/cluster-baremetal-operator/blob/6f0a255fdcc7c0e5c04166cb9200be4cee44f4b7/provisioning/utils.go#L95-L127] as one return value is usable forever but the other needs to somehow be re-queried every time the pod moves.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Relatively

Steps to Reproduce:

1. Retrieve the inspector IP from GetIronicIP
2. Reschedule the inspector service pod
3. Provision a host

Actual results:

Ironic python agent raises an exception

Expected results:

Host provisions

Additional info:

This was found while deploying clusters using ZTP

In this scenario specifically an image containing the ironic inspector IP is valid for an extended period of time. The same image can be used for multiple hosts and possibly multiple different spoke clusters.

Our controller shouldn't be expected to watch the ironic pod to ensure we update the image whenever it moves. The best we can do is re-query the inspector IP whenever a user makes changes to the image, but that may still not be often enough.

Description of problem:

We've observed a split brain case for keepalived unicast, where two worker nodes were fighting for the ingress VIP. 
One of these nodes failed to register itself with the cluster, so it was missing from the output of the node list. That, in turn, caused it to be missing from the unicast_peer list in keepalived. This one node believed it was the master (not receiving VRRP from other nodes), and other nodes constantly re-electing a master.

This behavior was observed in a QE-deployed cluster on PSI. It caused constant VIP flapping and a huge load on OVN.

Version-Release number of selected component (if applicable):


How reproducible:

Not sure. We don't know why the worker node failed to register with the cluster (the cluster is gone now) or what the QE were testing at the time. 

Steps to Reproduce:

1.
2.
3.

Actual results:

The cluster was unhealthy due to the constant Ingress VIP failover. It was also putting a huge load on PSI cloud.

Expected results:

The flapping VIP can be very expensive for the underlying infrastructure. In no way we should allow OCP to bring the underlying infra down.

The node should not be able to claim the VIP when using keepalived in unicast mode unless they have correctly registered with the cluster and they appear in the node list.

Additional info:


Description of problem:

When we configure a MC using an osImage that cannot be pulled, the machine config daemon pod spams logs saying that the node is set to "Degraded" state, but the node is not set to "Degraded" state.

Only after long time, like 20 minutes or half and hour, the node eventually becomes degraded.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-26-111919

How reproducible:

Always

Steps to Reproduce:

1. Create a MC using an osImage that cannot be pulled

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: "2022-09-27T12:48:13Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: worker
  name: not-pullable-image-tc54054-w75j1k67
  resourceVersion: "374500"
  uid: 7f828fbc-8da3-4f16-89e2-34e39ff830b3
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files: []
    systemd:
      units: []
  osImageURL: quay.io/openshifttest/tc54054fakeimage:latest


2. Check the logs in the machine config daemon pod, you can see this message being spammed, saying that the daemon is marking the node with "Degraded" status.

E0927 14:31:22.858546    1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found
E0927 14:34:10.698564    1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found
E0927 14:36:58.557340    1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found


Actual results:

The node is not marked as degraded as it should. Only after long time, 20 minutes or so, the node becomes degraded.

Expected results:

When the podman pull command fails and the machine config daemon sets the node state as "Degraded", the node should actually be marked as "Degraded".

Additional info:

 

 

 

Description of problem:

create new host and cluster folder qe-cluster under datacenter, and move cluster workloads into that folder.

$ govc find -type r
/OCP-DC/host/qe-cluster/workloads

using below install-config.yaml file to create single zone cluster.

apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: 
    vsphere:
      cpus: 4
      memoryMB: 8192
      osDisk:
        diskSizeGB: 60
      zones:
        - us-east-1
  replicas: 2
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere: 
      cpus: 4
      memoryMB: 16384 
      osDisk:
        diskSizeGB: 60
      zones:
        - us-east-1
  replicas: 3
metadata:
  name: jima-permission
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.19.46.0/24
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    apiVIP: 10.19.46.99
    cluster: qe-cluster/workloads
    datacenter: OCP-DC
    defaultDatastore: my-nfs
    ingressVIP: 10.19.46.98
    network: "VM Network"
    username: administrator@vsphere.local
    password: xxx
    vCenter: xxx
    vcenters:
    - server: xxx
      user: administrator@vsphere.local
      password: xxx
      datacenters:
      - OCP-DC
    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      topology:
        datacenter: OCP-DC
        computeCluster: /OCP-DC/host/qe-cluster/workloads
        networks:
        - "VM Network"
        datastore: my-nfs
      server: xxx
pullSecret: xxx 

installer get error:

$ ./openshift-install create cluster --dir ipi5 --log-level debug
DEBUG   Generating Platform Provisioning Check...  
DEBUG   Fetching Common Manifests...               
DEBUG   Reusing previously-fetched Common Manifests 
DEBUG Generating Terraform Variables...            
FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get vSphere network ID: could not find vSphere cluster at /OCP-DC/host//OCP-DC/host/qe-cluster/workloads: cluster '/OCP-DC/host//OCP-DC/host/qe-cluster/workloads' not found 
 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

always

Steps to Reproduce:

1. create new host/cluster folder under datacenter, and move vsphere cluster into that folder
2. prepare install-config with zone configuration
3. deploy cluster

Actual results:

fail to create cluster

Expected results:

succeed to create cluster

Additional info:

 

 

 

 

 

Description of problem:

Event source is visible without even creating knative-eventing or knative-serving.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Install Openshift Serverless Operator and don't create Knative-Eventing and Knative-Serving.
2. Goto Add page, Event Source option is visible under Eventing card.

Actual results:

Event Source option is visible under Eventing card.

Expected results:

Event Source option not to be visible untill Knative-Eventing is created.

Additional info:

 

This is a clone of issue OCPBUGS-8483. The following is the description of the original issue:

Description of problem:

We merged a change into origin to modify a test so that `/readyz` would be used as the health check path. It turns out this makes things worse because we want to use kube-proxy's health probe endpoint to monitor the node health, and kube-proxy only exposes `/healthz` which is the default path anyway.

We should remove the annotation added to change the path and go back to the defaults.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-10829. The following is the description of the original issue:

Description of problem:

In 7 day's reliability test, kube-apiserver's memory usage keep increasing. Max is over 3GB.
In our 4.12 test result, the kube-apiserver's memory usage was stable around 1.7 GB and not keep increasing. 
I'll redo the test on a 4.12.0 build to see if I can reproduce this issue.

I'll do a longer than 7 days test to see how high the memory can grow up.

About Reliability Test
https://github.com/openshift/svt/tree/master/reliability-v2

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-14-053612

How reproducible:

Always

Steps to Reproduce:

1. Install an AWS cluster with m5.xlarge type
2. Run reliability test for 7 days
Reliability Test Configuration example:
https://github.com/openshift/svt/tree/master/reliability-v2#groups-and-tasks-1
Config used in this test:
admin: 1 user
dev-test: 15 users
dev-prod: 1 user 
3. Use dittybopper dashboard to monitor the kube-apiserver's memory usage

Actual results:

kube-apiserver's memory usage keep increasing. Max is over 3GB

Expected results:

kube-apiserver's memory usage should not keep increasing

Additional info:

Screenshots are uploaded to shared folder OCPBUGS-10829 - Google Drive

413-kube-apiserver-memory.png
413-api-performance-last2d.png - test was stopped on [2023-03-24 04:21:10 UTC]
412-kube-apiserver-memory.png
must-gather.local.525817950490593011.tar.gz - 4.13 cluster's must gather

Description of problem:

Openshift UPI on vsphere is giving below mentioned error No-op: Unable to retrieve machine from node "/worker7.ocpdev.openshiftop.hbcbsnj.local": expecting one machine for node worker7.ocpdev.openshiftop.hbcbsnj.local, got: []

Version-Release number of selected component (if applicable):

Openshift 4.8 and 4.10

How reproducible:

 

Steps to Reproduce:

1. Install fresh cluster on vsphere through UPI
2. Cluster is in healthy state but receiving below mentioned error for all nodes:
~~~
2022-09-26T17:10:44.371671808Z E0926 17:10:44.371179       1 machinehealthcheck_controller.go:569] No-op: Unable to retrieve machine from node "/worker7.ocpdev.xxx.hbcbsnj.local": expecting one machine for node worker7.ocpdev.xxx.hbcbsnj.local, got: []
2022-09-26T17:10:44.371671808Z E0926 17:10:44.371217       1 machinehealthcheck_controller.go:569] No-op: Unable to retrieve machine from node "/worker7.ocpdev.xxx.hbcbsnj.local": expecting one machine for node worker7.ocpdev.xxx.hbcbsnj.local, got: []
2022-09-26T17:10:44.751996262Z E0926 17:10:44.751950       1 machinehealthcheck_controller.go:569] No-op: Unable to retrieve machine from node "/worker5.ocpdev.xxx.hbcbsnj.local": expecting one machine for node worker5.ocpdev.xxx.hbcbsnj.local, got: []
2022-09-26T17:10:44.751996262Z E0926 17:10:44.751989       1 machinehealthcheck_controller.go:569] No-op: Unable to retrieve machine from node "/worker5.ocpdev.xxx.hbcbsnj.local": expecting one machine for node worker5.ocpdev.xxx.hbcbsnj.local, got: []
~~~ 

Actual results: Receiving the mentioned alerts

Expected results:

We should not receive such alerts since there are no machine and machinesets exist

Additional info:

 

 

Description of problem:

Install fully private cluster on Azure against 4.12.0-0.nightly-2022-11-10-033725, sa for coreOS image have public access.

$ az storage account list -g jima-azure-11a-f58lp-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
clusterptkpx    True
imageregistryjimaazrsgcc    False

same profile on 4.11.0-0.nightly-2022-11-10-202051, sa for coreos image are not publicly accessible.

$ az storage account list -g jima-azure-11c-kf9hw-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
clusterr8wv9    False
imageregistryjimaaz9btdx    False 

Checked that terraform-provider-azurerm version is different between 4.11 and 4.12.

4.11: v2.98.0

4.12: v3.19.1

In terraform-provider-azurerm v2.98.0, it use property allow_blob_public_access to manage sa public access, the default value is false.

In  terraform-provider-azurerm v3.19.1, property allow_blob_public_access is renamed to allow_nested_items_to_be_public , the default value is true. 

https://github.com/hashicorp/terraform-provider-azurerm/blob/main/CHANGELOG.md#300-march-24-2022

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-10-033725

How reproducible:

always on 4.12

Steps to Reproduce:

1. Install fully private cluster on azure against 4.12 payload
2. 
3.

Actual results:

sa for coreos image is publicly accessible

Expected results:

sa for coreos image should not be publicly accessible

Additional info:

only happened on 4.12

 

 

Description of problem:

When the Insights operator is marked as disabled then the "Available" operator condition is updated every 2 mins. This is not desired and gives an impression that the operator is restarted every 2 mins 

Version-Release number of selected component (if applicable):

 

How reproducible:

No extra steps needed, just watch "oc get co insights --watch"

Steps to Reproduce:

1.
2.
3.

Actual results:

available condition transition time updated every 2 min

Expected results:

available condition is updated only when its status changed

Additional info:

 

Description of problem:

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.

Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Jan  6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...

Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'

Version-Release number of selected component (if applicable):

4.11+ openshift-tests

How reproducible:

nondeterministic, wild guess is ~30% of upgrade jobs

Steps to Reproduce:

1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test

Actual results:

Jan 23 00:47:43.842: INFO: Admin Ack verified
Jan 23 00:57:43.836: INFO: Admin Ack verified
Jan 23 01:07:43.839: INFO: Admin Ack verified
Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8

No check done after the upgrade

Expected results:

Jan 23 00:57:37.894: INFO: Admin Ack verified
Jan 23 01:07:37.894: INFO: Admin Ack verified
Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d
Jan 23 01:17:57.937: INFO: Admin Ack verified

One or more checks done after upgrade

This is a clone of issue OCPBUGS-5461. The following is the description of the original issue:

Description of problem:

When installing a 3 master + 2 worker BM IPv6 cluster with proxy, worker BMHs are failing inspection with the message: "Could not contact ironic-inspector for version discovery: Unable to find a version discovery document". This causes the installation to fail due to nodes with worker role never joining the cluster. However, when installing with no workers, the issue does not reproduce and the cluster installs successfully.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-04-203333

How reproducible:

100%

Steps to Reproduce:

1. Attempt to install an IPv6 cluster with 3 masters + 2 workers and proxy with baremetal installer

Actual results:

Installation never completes because a number of pods are in Pending status

Expected results:

Workers join the cluster and installation succeeds 

Additional info:

$ oc get events
LAST SEEN   TYPE     REASON              OBJECT                               MESSAGE
174m        Normal   InspectionError     baremetalhost/openshift-worker-0-1   Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
174m        Normal   InspectionError     baremetalhost/openshift-worker-0-0   Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
174m        Normal   InspectionStarted   baremetalhost/openshift-worker-0-0   Hardware inspection started
174m        Normal   InspectionStarted   baremetalhost/openshift-worker-0-1   Hardware inspection started

In next_best_guess.go, we have logic that will fall back to nurp combos if we don't have one for the current job.

Some important things to remember:

  • right now data doesn't go into the query_results.json file in origin if we do not have the required 100 runs to do a semi-accurate P99.
  • data also doesn't go into the query_results.json file in origin for older releases, we only have 4.13 data in master of origin now.

We recently hit this with gcp ovn micro upgrades, which we don't really have data for, falling back to minor upgrades, and showing big differences for LB backend.

How often are we falling back across CI? How often are we failing if we fall back?
search.ci might help us see this.

Does the list of fallbacks look safe, should some be removed?

Investigate what happens when we move origin master from 4.13 to 4.14. This would be handled in the ci-tools repo when we generate the PR to update this file, and Egli did add some logic for this case. I cannot immediately remember what it does though, fallbacks could be important during this time.

Description of problem:

SDNPodNotReady annotations is "SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.", but there is not "labels.node" for kube_pod_status_read metrics

# oc -n openshift-sdn get prometheusrules networking-rules -oyaml | grep SDNPodNotReady -C12
...
    - alert: SDNPodNotReady
      annotations:
        message: SDN pod {{ $labels.pod }} on node {{ $labels.node }} is not ready.
      expr: |
        kube_pod_status_ready{namespace='openshift-sdn', condition='true'} == 0
      for: 10m
      labels:
        severity: warning

see:

# token=`oc create token prometheus-k8s -n openshift-monitoring` 
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?" --data-urlencode "query=kube_pod_status_ready" | jq
...
      {
        "metric": {
          "__name__": "kube_pod_status_ready",
          "condition": "false",
          "container": "kube-rbac-proxy-main",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "openshift-apiserver",
          "pod": "apiserver-8668766666-6rftj",
          "prometheus": "openshift-monitoring/k8s",
          "service": "kube-state-metrics",
          "uid": "687a5a01-62f6-448b-a706-909fc7bc6872"
        },
        "value": [
          1669025915.468,
          "0"
        ]

 

https://github.com/kubernetes/kube-state-metrics/blob/master/docs/pod-metrics.md

also shows there is not node label for kube_pod_status_ready

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-19-191518

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Description of problem:

On Openshift on Openstack CI, we are running conformance serial tests, and the test [sig-cli] oc status can show correct status after switching between projects [apigroup:project.open
shift.io][apigroup:image.openshift.io][Serial] [Suite:openshift/conformance/serial] is leaking some resources in error status on the cluster that are provoking failures on our automation:

(shiftstack) [stack@undercloud-0 ~]$ KUBECONFIG=~/.kube/config ./openshift-tests run-test "[sig-cli] oc status can show correct status after switching between projects [apigroup:project.open
shift.io][apigroup:image.openshift.io][Serial] [Suite:openshift/conformance/serial]"                                                                                                         
Dec 12 09:42:09.333: INFO: Enabling in-tree volume drivers                                                                                                                                   
[BeforeEach] TOP-LEVEL                                                                                                                                                                        
  github.com/openshift/origin/test/extended/util/framework.go:1486                                                                                                                           
[BeforeEach] TOP-LEVEL                                                                                                                                                                        
  github.com/openshift/origin/test/extended/util/framework.go:1486                                                                                                                           
[BeforeEach] TOP-LEVEL                                                                                                                                                                       
  github.com/openshift/origin/test/extended/util/test.go:58                                                                                                                                   
[BeforeEach] [sig-cli] oc status                                                                                                                                                             
  github.com/openshift/origin/test/extended/util/client.go:160                                                                                                                                
STEP: Creating a kubernetes client 12/12/22 09:42:10.19                                                                                                                                      
[BeforeEach] [sig-cli] oc status                                                                                                                                                              
  github.com/openshift/origin/test/extended/util/client.go:134                                                                                                                               
Dec 12 09:42:10.609: INFO: configPath is now "/tmp/configfile1329447316"                                                                                                                      
Dec 12 09:42:10.609: INFO: The user is now "e2e-test-oc-status-qkkkd-user"                                                                                                                    
Dec 12 09:42:10.609: INFO: Creating project "e2e-test-oc-status-qkkkd"                                                                                                                       
Dec 12 09:42:10.780: INFO: Waiting on permissions in project "e2e-test-oc-status-qkkkd" ...                                                                                                  
Dec 12 09:42:10.854: INFO: Waiting for ServiceAccount "default" to be provisioned...                                                                                                         
Dec 12 09:42:10.960: INFO: Waiting for ServiceAccount "deployer" to be provisioned...                                                                                                        
Dec 12 09:42:11.069: INFO: Waiting for ServiceAccount "builder" to be provisioned...                                                                                                         
Dec 12 09:42:11.177: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...                                                                                               
Dec 12 09:42:11.185: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...                                                                                              
Dec 12 09:42:11.196: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...                                                                                                   
Dec 12 09:42:11.465: INFO: Project "e2e-test-oc-status-qkkkd" has been fully provisioned.                                                                                                    
[It] can show correct status after switching between projects [apigroup:project.openshift.io][apigroup:image.openshift.io][Serial] [Suite:openshift/conformance/serial]                      
  github.com/openshift/origin/test/extended/cli/status.go:31                                                                                                                                 
Dec 12 09:42:11.466: INFO: Running 'oc --namespace=e2e-test-oc-status-qkkkd --kubeconfig=/tmp/configfile1329447316 status --all-namespaces'                                                  
Dec 12 09:42:11.615: INFO: Running 'oc --namespace=e2e-test-oc-status-qkkkd --kubeconfig=/tmp/configfile1329447316 status -A'                                                                
STEP: create a new project 12/12/22 09:42:11.811                                                                                                                                             
Dec 12 09:42:11.812: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 new-project e2e-test-oc-status-qkkkd-project-bar --display-name=my project --description=test project'         
Now using project "e2e-test-oc-status-qkkkd-project-bar" on server "https://api.ostest.shiftstack.com:6443".                                                                                 
                                                                                                                                                                                             
    kubectl create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname                                                                          
Dec 12 09:42:12.197: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 project'                                                                                                       
STEP: make sure `oc status` does not use "no projects" message if there is a project created 12/12/22 09:42:12.291
Dec 12 09:42:12.291: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 status'
STEP: create a second project 12/12/22 09:42:12.459
Dec 12 09:42:12.459: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 new-project e2e-test-oc-status-qkkkd-project-bar-2 --display-name=my project 2 --description=test project 2'
Now using project "e2e-test-oc-status-qkkkd-project-bar-2" on server "https://api.ostest.shiftstack.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname
Dec 12 09:42:12.702: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 project'
STEP: delete the current project `e2e-test-oc-status-qkkkd-project-bar-2` and make sure `oc status` does not return the "no projects" message since `e2e-test-oc-status-qkkkd-project-bar` sti
ll exists 12/12/22 09:42:12.805
Dec 12 09:42:12.806: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 delete project e2e-test-oc-status-qkkkd-project-bar-2'
Dec 12 09:42:13.060: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 project e2e-test-oc-status-qkkkd-project-bar'
Now using project "e2e-test-oc-status-qkkkd-project-bar" on server "https://api.ostest.shiftstack.com:6443".
Dec 12 09:42:13.180: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 delete project e2e-test-oc-status-qkkkd-project-bar'
project.project.openshift.io "e2e-test-oc-status-qkkkd-project-bar" deleted
Dec 12 09:42:19.659: INFO: Running 'oc --namespace=e2e-test-oc-status-qkkkd --kubeconfig=/tmp/configfile1329447316 get projects'
Dec 12 09:42:19.797: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 new-project e2e-test-oc-status-qkkkd-project-status --display-name=my project --description=test project'
Now using project "e2e-test-oc-status-qkkkd-project-status" on server "https://api.ostest.shiftstack.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:


                                                                                                                                                                                             
You can add applications to this project with the 'new-app' command. For example, try:                                                                                                        
                                                                                                                                                                                             
    oc new-app rails-postgresql-example                                                                                                                                                       
                                                                                                                                                                                             
to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:                                                                                         

    kubectl create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname
STEP: verify jobs are showing in status 12/12/22 09:42:20.152
Dec 12 09:42:20.152: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 create job pi --image=image-registry.openshift-image-registry.svc:5000/openshift/tools:latest -- perl -Mbignum=b
pi -wle 'print bpi(2000)''
Warning: would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "pi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (
container "pi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "pi" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container
"pi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
job.batch/pi created
Dec 12 09:42:20.285: INFO: Running 'oc --kubeconfig=/tmp/configfile1329447316 status'
[AfterEach] [sig-cli] oc status
  github.com/openshift/origin/test/extended/util/client.go:158
Dec 12 09:42:20.483: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-oc-status-qkkkd-user}, err: <nil>
Dec 12 09:42:20.497: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-oc-status-qkkkd}, err: <nil>
Dec 12 09:42:20.511: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~0QodLJemrdDv5q4xiYIvaUD4-E0JK4s6JN3BXnEkLto}, err: <nil>
[AfterEach] [sig-cli] oc status
  github.com/openshift/origin/test/extended/util/client.go:159
STEP: Destroying namespace "e2e-test-oc-status-qkkkd" for this suite. 12/12/22 09:42:20.511

A namespace with Error pods remains on the cluster:

$ oc get pods -n e2e-test-oc-status-qkkkd-project-status
NAME       READY   STATUS   RESTARTS   AGE
pi-85cpc   0/1     Error    0          8s
pi-t9qlf   0/1     Error    0          13s
pi-xd5dd   0/1     Error    0          3s

$ oc logs -n e2e-test-oc-status-qkkkd-project-status pi-85cpc
Can't locate bignum.pm in @INC (you may need to install the bignum module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5).
BEGIN failed--compilation aborted.
                                                                                                                                              

Version-Release number of selected component (if applicable):

$ git branch
* release-4.12
$ git log | head
commit 7c39a7d52e43c54a0cef2cf83900e48c9ab73009
Author: OpenShift Merge Robot <openshift-merge-robot@users.noreply.github.com>
Date:   Mon Dec 5 13:28:17 2022 -0500

    Merge pull request #27588 from vrutkovs/4.12-bump-k8s
    
    OCPBUGS-2927: [release-4.12] Bump kubernetes to latest release-4.12

How reproducible:

Always

Steps to Reproduce:

1. Run conformance tests case:

[sig-cli] oc status can show correct status after switching between projects [apigroup:project.open
shift.io][apigroup:image.openshift.io][Serial] [Suite:openshift/conformance/serial]

2. Check the namespace generated and the remaining pods in error.
3.

Actual results:

Test passing with leaks that are failing post-checks on our automation.

Expected results:

Test passing without leaks

Additional info:

N/A

Description of problem:

catalog pod restarting frequently  after one stack trace daily.          ~~~                                                                          $ omc logs catalog-operator-f7477865d-x6frl -p
2023-01-04T13:05:15.175952229Z time="2023-01-04T13:05:15Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
2023-01-04T13:05:15.175952229Z fatal error: concurrent map read and map write
2023-01-04T13:05:15.178587884Z
2023-01-04T13:05:15.178674833Z goroutine 669 [running]:
2023-01-04T13:05:15.179284556Z runtime.throw({0x1efdc12, 0xc000580000})
2023-01-04T13:05:15.179458107Z 	/usr/lib/golang/src/runtime/panic.go:1198 +0x71 fp=0xc00559d098 sp=0xc00559d068 pc=0x43bcd1
2023-01-04T13:05:15.179707701Z runtime.mapaccess1_faststr(0x7f39283dd878, 0x10, {0xc000894c40, 0xf})
2023-01-04T13:05:15.179932520Z 	/usr/lib/golang/src/runtime/map_faststr.go:21 +0x3a5 fp=0xc00559d100 sp=0xc00559d098 pc=0x418ca5
2023-01-04T13:05:15.180181245Z github.com/operator-framework/operator-lifecycle-manager/pkg/metrics.UpdateSubsSyncCounterStorage(0xc00545cfc0)       ~~~

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Slack discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1673120541153639                            MG link - https://attachments.access.redhat.com/hydra/rest/cases/03396604/attachments/25f23643-2447-442b-ba26-4338b679b8cc?usePresignedUrl=true

 

Context:

We currently convey cloud creds issues in ValidOIDCConfiguration and ValidAWSIdentityProvider conditions.

The HO relies on those https://github.com/openshift/hypershift/blob/9e4127055dd7be9cfe4fc8427c39cee27a86efcd/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L293

to decide if forcefully deletion should be applied and so potentially intentionally leaving resources behind in cloud. (E.g. use case: oidc creds where broken out of band).

The CPO relies on those to wait for deletion of guest cluster resources https://github.com/openshift/hypershift/blob/8596f7f131169a19c6a67dc6ce078c50467de648/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L284-L299

DoD:

When any of the cases above results in the "move kube deletion forward skipping cloud resource deletion" path we should send a metric so consumers / SREs have a sense and can use it to notify customers in conjunction with https://issues.redhat.com/browse/SDA-8613

 

Description of problem:

EUS-to-EUS upgrade(4.10.38-4.11.13-4.12.0-rc.0), after control-plane nodes are upgraded to 4.12 successfully, unpause the worker pool to get worker nodes updated. But worker nodes failed to be updated with degraded worker pool:
```
# ./oc get node
NAME                                                   STATUS                     ROLES    AGE     VERSION
jliu410-6hmkz-master-0.c.openshift-qe.internal         Ready                      master   4h40m   v1.25.2+f33d98e
jliu410-6hmkz-master-1.c.openshift-qe.internal         Ready                      master   4h40m   v1.25.2+f33d98e
jliu410-6hmkz-master-2.c.openshift-qe.internal         Ready                      master   4h40m   v1.25.2+f33d98e
jliu410-6hmkz-worker-a-xdwvv.c.openshift-qe.internal   Ready,SchedulingDisabled   worker   4h31m   v1.23.12+6b34f32
jliu410-6hmkz-worker-b-9hnb8.c.openshift-qe.internal   Ready                      worker   4h31m   v1.23.12+6b34f32
jliu410-6hmkz-worker-c-bdv4f.c.openshift-qe.internal   Ready                      worker   4h31m   v1.23.12+6b34f32
...
# ./oc get co machine-config
machine-config   4.12.0-rc.0   True        False         True       3h41m   Failed to resync 4.12.0-rc.0 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)]
...
# ./oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-b81233204496767f2fe32fbb6cb088e1   True      False      False      3              3                   3                     0                      4h10m
worker   rendered-worker-a2caae543a144d94c17a27e56038d4c4   False     True       True       3              0                   0                     1                      4h10m
...
# ./oc describe mcp worker
Message:                   Reason:                    Status:                True    Type:                  Degraded    Last Transition Time:  2022-11-14T07:19:42Z    Message:               Node jliu410-6hmkz-worker-a-xdwvv.c.openshift-qe.internal is reporting: "Error checking type of update image: error running skopeo inspect --no-tags --retry-times 5 --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c01b0ae9870dbee5609c52b4d649334ce6854fff1237f1521929d151f6876daa: exit status 1\ntime=\"2022-11-14T07:42:47Z\" level=fatal msg=\"unknown flag: --no-tags\"\n"    Reason:                1 nodes are reporting degraded status on sync    Status:                True    Type:                  NodeDegraded
...
# ./oc logs machine-config-daemon-mg2zn
E1114 08:11:27.115577  192836 writer.go:200] Marking Degraded due to: Error checking type of update image: error running skopeo inspect --no-tags --retry-times 5 --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c01b0ae9870dbee5609c52b4d649334ce6854fff1237f1521929d151f6876daa: exit status 1
time="2022-11-14T08:11:25Z" level=fatal msg="unknown flag: --no-tags"
```

Version-Release number of selected component (if applicable):

4.12.0-rc.0

How reproducible:

 

Steps to Reproduce:

1. EUS upgrade with path 4.10.38-> 4.11.13-> 4.12.0-rc.0 with paused worker pool 
2. After master pool upgrade succeed, unpause worker pool 
3.

Actual results:

Worker pool upgrade failed

Expected results:

Worker pool upgrade succeed

Additional info:

 

Description of the problem:

fio's filename can be a colon-separated list of devices to test. This however breaks some paths with colons, so these need to be escaped.
See fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-filename

So when we pass paths like 

/dev/disk/by-path/pci-0000:06:0000.0 

fio would write bytes to

/dev/disk/by-path/pci-0000 

which may be entirely different disk. Instead we need to escape colons:

 /dev/disk/by-path/pci-0000\:06\:0000.0 

How reproducible:

Always in OKD SNO, not sure how easy is it to reproduce on RHCOS

Steps to reproduce:

1. Boot from a very small disk (i.e. 6GB) or use OKD type of deployment as it overlays 6GB ram disk

Actual results:

disk-speed-check fails with "no space on disk" as `fio` creates /dev/disk/by-path/pci-0000 in container

Expected results:

fio test passes

This is a clone of issue OCPBUGS-10207. The following is the description of the original issue:

Description of problem:

When the releaseImage is a digest, for example quay.io/openshift-release-dev/ocp-release@sha256:bbf1f27e5942a2f7a0f298606029d10600ba0462a09ab654f006ce14d314cb2c, a spurious warning is putput when running
openshift-install agent create image

Its not calculating the releaseImage properly (see the '@sha' suffix below) so it causes this spurious message
WARNING The ImageContentSources configuration in install-config.yaml should have at-least one source field matching the releaseImage value quay.io/openshift-release-dev/ocp-release@sha256 

This can cause confusion for users.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Every time when using a release image with a digest is used

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-11479. The following is the description of the original issue:

Description of problem:

There is error when creating image:
FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-06-060829

How reproducible:

always

Steps to Reproduce:

1. Prepare the agent-config.yaml and install-config.yaml files

2. Run 'bin/openshift-install agent create image --log-level debug'

3. There is following output with errors:
DEBUG extracting /usr/bin/agent-tui to /home/core/.cache/agent/files_cache, oc image extract --path /usr/bin/agent-tui:/home/core/.cache/agent/files_cache --confirm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c11d31d47db4afb03e4a4c8c40e7933981a2e3a7ef9805a1413c441f492b869b 
DEBUG Fetching image from OCP release (oc adm release info --image-for=agent-installer-node-agent --insecure=true registry.ci.openshift.org/ocp/release@sha256:83caa0a8f2633f6f724c4feb517576181d3f76b8b76438ff752204e8c7152bac) 
DEBUG extracting /usr/lib64/libnmstate.so.1.3.3 to /home/core/.cache/agent/files_cache, oc image extract --path /usr/lib64/libnmstate.so.1.3.3:/home/core/.cache/agent/files_cache --confirm quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c11d31d47db4afb03e4a4c8c40e7933981a2e3a7ef9805a1413c441f492b869b 
DEBUG File /usr/lib64/libnmstate.so.1.3.3 was not found, err stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory 
ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors 
FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": stat /home/core/.cache/agent/files_cache/libnmstate.so.1.3.3: no such file or directory  

Actual results:

The image generate fail

Expected results:

The image should generate success.

Additional info:

 

Description of the problem:

The default Openshift version of the subsystem tests is determined during the package loading time with this dynamic code:

if reply, err := userBMClient.Versions.V2ListSupportedOpenshiftVersions(context.Background(),
&versions.V2ListSupportedOpenshiftVersionsParams{}); err == nil {
for openshiftVersion = range reply.GetPayload()

{ break }

}
Since this code is not deterministic (the API does not guarantee to return the same version) we get some flakiness in tests:
Examples: 
1. This commit had to set some logic so cluster works with default network type both in 4.12 and prior
2. This thread in CI forum https://app.slack.com/client/T027F3GAJ/C014N2VLTQE/thread/C014N2VLTQE-1666164530.003809 about similar issue in progress bar test

Required Solution:
Make subsystem run with known and unify version

 

 

 

This is a clone of issue OCPBUGS-11930. The following is the description of the original issue:

Description of problem:

VPC endpoint service cannot be cleaned up by HyperShift operator when the OIDC provider of the customer cluster has been deleted.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Sometimes

Steps to Reproduce:

1.Create a HyperShift hosted cluster
2.Delete the HyperShift cluster's OIDC provider in AWS
3.Delete the HyperShift hosted cluster

Actual results:

Cluster is stuck deleting

Expected results:

Cluster deletes

Additional info:

The hypershift operator is stuck trying to delete the AWS endpoint service but it can't be deleted because it gets an error that there are active connections.

This is a clone of issue OCPBUGS-8082. The following is the description of the original issue:

Description of problem:

Currently during the gathering some of the ServiceAccounts were lost. This tasks fixes that problem.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4998. The following is the description of the original issue:

If the cluster enters the installing-pending-user-action state in assisted-service, it will not recover absent user action.
One way to reproduce this is to have the wrong boot order set in the host, so that it reboots into the agent ISO again instead of the installed CoreOS on disk. (I managed this in dev-scripts by setting a root device hint that pointed to a secondary disk, and only creating that disk once the VM was up. This does not add the new disk to the boot order list, and even if you set it manually it does not take effect until after a full shutdown of the VM - the soft reboot doesn't count.)

Currently we report:

cluster has stopped installing... working to recover installation

in a loop. This is not accurate (unlike in e.g. the install-failed state) - it cannot be recovered automatically.

Also we should only report this, or any other, status once when the status changes, and not continuously in a loop.

This is a clone of issue OCPBUGS-11434. The following is the description of the original issue:

Description of problem:

node-exporter profiling shows that ~16% of CPU time is spend fetching details about btrfs mounts. RHEL kernel doesn't have btrfs, so its safe to disable this exporter

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

It is very easy to accidentally use the traditional openshift-install wait-for <x>-complete commands instead of the equivalent openshift-install agent wait-for <x>-complete command. This will work in some stages of the install, but show much less information or fail altogether in other stages of the install.
If we can detect from the asset store that this was an agent-based install, we should issue a warning if the user uses the old command.

Description of problem:

Test located at github.com/openshift/origin/test/extended/apiserver/api_requests.go:449 is failing '[It] operators should not create watch channels very often [Suite:openshift/conformance/parallel]':

"Operator \"console-operator\" produces more watch requests than expected: watchrequestcount=115, upperbound=112, ratio=1.0267857142857142"


Version-Release number of selected component (if applicable):

4.13

How reproducible:

Found in 0.07% of runs (0.52% of failures) across 36474 total runs and 3989 jobs (13.68% failed) in 400ms

https://search.ci.openshift.org/?search=console-operator%5C%5C%22+produces+more+watch+requests+than+expected&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:

Console operator exceeds watch request limit

Expected results:

Console operator doesn't exceed watch request limit

Additional info:

 

Description of problem:

[sig-network-edge] DNS should answer queries using the local DNS 
endpoint [Suite:openshift/conformance/parallel]" appears to be failing 
pretty consistently for job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.13-informing#periodic-ci-openshift-multiarch-master[…]htly-4.13-ocp-e2e-aws-ovn-heterogeneous 

Above test fails when its running on arm64 node.

This test helps determine if install and upgrade works and currently it appears to be perma failing.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

exec format error

Expected results:

 

Additional info:

We can add a nodeaffinity to define it runs on linux/amd64 nodes only until we have a complete manifest list support in the app.ci registry to resolve exec format error

test=[sig-network-edge] DNS should answer queries using the local DNS
endpoint [Suite:openshift/conformance/parallel]

This is a clone of issue OCPBUGS-160. The following is the description of the original issue:

Description of problem:

The NS autolabeler should adjust the PSS namespace labels such that a previously permitted workload (based on the SCCs it has access to) can still run.

The autolabeler requires the RoleBinding's .subjects[].namespace to be set when .subjects[].kind is ServiceAccount even though this is not required by the RBAC system to successfully bind the SA to a Role

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.7.0-0.ci-2021-05-21-142747
Server Version: 4.12.0-0.nightly-2022-08-15-150248
Kubernetes Version: v1.24.0+da80cd0

How reproducible: 100%

Steps to Reproduce:

---
apiVersion: v1
kind: Namespace
metadata:
  name: test

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mysa
  namespace: test

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: myrole
  namespace: test
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: myrb
  namespace: test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: myrole
subjects:
- kind: ServiceAccount
  name: mysa
  #namespace: test  # This is required for the autolabeler

---
kind: Job
apiVersion: batch/v1
metadata:
  name: myjob
  namespace: test
spec:
  template:
    spec:
      containers:
        - name: ubi
          image: registry.access.redhat.com/ubi8
          command: ["/bin/bash", "-c"]
          args: ["whoami; sleep infinity"]
      restartPolicy: Never
      securityContext:
        runAsUser: 0
      serviceAccount: mysa
      terminationGracePeriodSeconds: 2
{{}}

Actual results:

Applying the manifest, above, the Job's pod will not start:

$ kubectl -n test describe job/myjob...Events:
  Type     Reason        Age   From            Message
  ----     ------        ----  ----            -------
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-zxcvv" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-fkb9x" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  10s   job-controller  Error creating: pods "myjob-5klpc" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Uncommenting the "namespace" field in the RoleBinding will allow it to start as the autolabeler will adjust the Namespace labels.

However, the namespace field isn't actually required by the RBAC system. Instead of using the autolabeler, the pod can be allowed to run by (w/o uncommenting the field):

$ kubectl label ns/test security.openshift.io/scc.podSecurityLabelSync=false
namespace/test labeled
$ kubectl label ns/test pod-security.kubernetes.io/enforce=privileged --overwrite
namespace/test labeled

 

We now see that the pod is running as root and has access to the privileged scc:

$ kubectl -n test get po -oyaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.18/23"],"mac_address":"0a:58:0a:81:02:12","gateway_ips":["10.129.2.1"],"ip_address":"10.129.2.18/23","gateway_ip":"10.129.2.1"'}}
      k8s.v1.cni.cncf.io/network-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      k8s.v1.cni.cncf.io/networks-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      openshift.io/scc: privileged
    creationTimestamp: "2022-08-16T13:08:24Z"
    generateName: myjob-
    labels:
      controller-uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
      job-name: myjob
    name: myjob-rwjmv
    namespace: test
    ownerReferences:
    - apiVersion: batch/v1
      blockOwnerDeletion: true
      controller: true
      kind: Job
      name: myjob
      uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
    resourceVersion: "36418"
    uid: 39f18dea-31d4-4783-85b5-8ae6a8bec1f4
  spec:
    containers:
    - args:
      - whoami; sleep infinity
      command:
      - /bin/bash
      - -c
      image: registry.access.redhat.com/ubi8
      imagePullPolicy: Always
      name: ubi
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6f2h6
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: mysa-dockercfg-mvmtn
    nodeName: ip-10-0-140-172.ec2.internal
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext:
      runAsUser: 0
    serviceAccount: mysa
    serviceAccountName: mysa
    terminationGracePeriodSeconds: 2
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-6f2h6
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
        - configMap:
            items:
            - key: service-ca.crt
              path: service-ca.crt
            name: openshift-service-ca.crt
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://8fd1c3a5ee565a1089e4e6032bd04bceabb5ab3946c34a2bb55d3ee696baa007
      image: registry.access.redhat.com/ubi8:latest
      imageID: registry.access.redhat.com/ubi8@sha256:08e221b041a95e6840b208c618ae56c27e3429c3dad637ece01c9b471cc8fac6
      lastState: {}
      name: ubi
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2022-08-16T13:08:28Z"
    hostIP: 10.0.140.172
    phase: Running
    podIP: 10.129.2.18
    podIPs:
    - ip: 10.129.2.18
    qosClass: BestEffort
    startTime: "2022-08-16T13:08:24Z"
kind: List
metadata:
  resourceVersion: ""
{{}}

 

$ kubectl -n test logs job/myjob
root

 

Expected results:

The autolabeler should properly follow the RoleBinding back to the SCC

 

Additional info:

If installation fails at an early stage (e.g. pulling release images, configuring hosts, waiting for agents to come up) there is no indication that anything has gone wrong, and the installer binary may not even be able to connect.
We should at least display what is happening on the console so that users have some avenue to figure out for themselves what is going on.

Description of problem

In OpenShift 4.7.0 and 4.6.20, cluster-ingress-operator started using the OpenShift-specific unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds annotation for router pods as a short-term measure to configure the liveness probe's grace period in order to fix OCPBUGSM-20760 (BZ#1899941). This annotation is implemented by a carry patch in openshift/kubernetes.

Since then, upstream Kubernetes has added a terminationGracePeriodSeconds API field to configure the liveness probe using a formal API (upstream doc reference). Using this API field will allow for the carry patch to be removed from openshift/kubernetes.

Example:

spec:
  terminationGracePeriodSeconds: 3600  # pod-level
  containers:
  - name: test
    image: ...

    ports:
    - name: liveness-port
      containerPort: 8080
      hostPort: 8080

    livenessProbe:
      httpGet:
        path: /healthz
        port: liveness-port
      failureThreshold: 1
      periodSeconds: 60
      # Override pod-level terminationGracePeriodSeconds #
      terminationGracePeriodSeconds: 10 

Version-Release number of selected component (if applicable)

OpenShift 4.13.

How reproducible

Always.

Steps to Reproduce

1. Check the annotation and API field on a running cluster: oc -n openshift-ingress get pods -Lingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o 'custom-columns=NAME:.medadata.name,ANNOTATION:.metadata.annotations.unsupported\.do-not-use\.openshift\.io\/override-liveness-grace-period-seconds,SPEC:.spec.containers[0].livenessProbe.terminationGracePeriodSeconds'

Actual results

The annotation is set, and the spec field is not:

% oc -n openshift-ingress get pods -Lingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o 'custom-columns=NAME:.metadata.name,ANNOTATION:.metadata.annotations.unsupported\.do-not-use\.openshift\.io\/override-liveness-grace-period-seconds,SPEC:.spec.containers[0].livenessProbe.terminationGracePeriodSeconds'
NAME                              ANNOTATION   SPEC
router-default-677f956f8b-d5lqz   10           <none>
router-default-677f956f8b-hntbb   10           <none>

Expected results

The annotation is not set, and the spec field is:

% oc -n openshift-ingress get pods -Lingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o 'custom-columns=NAME:.metadata.name,ANNOTATION:.metadata.annotations.unsupported\.do-not-use\.openshift\.io\/override-liveness-grace-period-seconds,SPEC:.spec.containers[0].livenessProbe.terminationGracePeriodSeconds'
NAME                              ANNOTATION   SPEC
router-default-677f956f8b-d5lqz   <none>       10
router-default-677f956f8b-hntbb   <none>       10

Description of problem:

Azure Disk volume is taking time to attach/detach
Version-Release number of selected component (if applicable):

Openshift ARO 4.10.30
How reproducible:

While performing scaledown and scaleup of statefulset pod takes time to attach and detach volume from nodes.

Reviewed must-gather and test output will share my findings in comments.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

 https://github.com/openshift/assisted-service/pull/4586/files

Wording should be reviewed by content writer, current messgae:

 
statusRebootTimeout = "Host timed out when pulling ignition. Check the host console and verify that it boots from the OpenShift installation disk $INSTALLATION_DISK and has network access to the cluster API. The installation will resume once the host successfully boots and can access the cluster API"
 

Description of problem:

Support for tech preview API extensions was introduced in https://github.com/openshift/installer/pull/6336 and https://github.com/openshift/api/pull/1274 .  In the case of https://github.com/openshift/api/pull/1278 , config/v1/0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml was introduced which seems to result in both 0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml and 0000_10_config-operator_01_infrastructure-Default.crd.yaml being rendered by the bootstrap.  As a result, both CRDs are created during bootstrap.  However, one of them(in this case the tech preview CRD) fails to be created.  

We may need to modify the render command to be aware of feature gates when rendering manifests during bootstrap.  Also, I'm open hearing other views on how this might work. 

Version-Release number of selected component (if applicable):

https://github.com/openshift/cluster-config-operator/pull/269 built and running on 4.12-ec5 

How reproducible:

consistently

Steps to Reproduce:

1. bump the version of OpenShift API to one including a tech preview version of the infrastructure CRD
2. install openshift with the infrastructure manifest modified to incorporate tech preview fields
3. those fields will not be populated upon installation

Also, checking the logs from bootkube will show both being installed, but one of them fails.

Actual results:

 

Expected results:

 

Additional info:

Excerpts from bootkube log
Nov 02 20:40:01 localhost.localdomain bootkube.sh[4216]: Writing asset: /assets/config-bootstrap/manifests/0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml
Nov 02 20:40:01 localhost.localdomain bootkube.sh[4216]: Writing asset: /assets/config-bootstrap/manifests/0000_10_config-operator_01_infrastructure-Default.crd.yaml


Nov 02 20:41:23 localhost.localdomain bootkube.sh[5710]: Created "0000_10_config-operator_01_infrastructure-Default.crd.yaml" customresourcedefinitions.v1.apiextensions.k8s.io/infrastructures.config.openshift.io -n
Nov 02 20:41:23 localhost.localdomain bootkube.sh[5710]: Skipped "0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml" customresourcedefinitions.v1.apiextensions.k8s.io/infrastructures.config.openshift.io -n  as it already exists

Description of problem:

The installer has logic that avoids adding the router CAs to the kubeconfig if the console is not available.  It's not clear why it does this, but it means that the router CAs don't get added when the console is deliberately disabled (it is now an optional capability in 4.12).

Version-Release number of selected component (if applicable):

Seen in 4.12+4.13

How reproducible:

Always, when starting a cluster w/o the Console capability

Steps to Reproduce:

1. Edit the install-config to set:
capabilities:
  baselineCapabilitySet: None
2. install the cluster
3. check the CAs in the kubeconfig, the wildcard route CA will be missing (compare it w/ a normal cluster)

Actual results:

router CAs missing

Expected results:

router CAs should be present

Additional info:

This needs to be backported to 4.12.

Description of problem:

Our pull request to pin down dependencies on the release-4.12 branch of CMO has missed the freeze. 
We are going to compensate it using this bug record, in 3 steps:
1. Pin down jsonnet and update go modules in master branch
2. Backport the pull request in step 1 into the release-4.12 branch
3. Restore the versions in jsonnet dependencies in master branch.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:


Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-12739. The following is the description of the original issue:

Description of problem:

The IPv6 VIP does not seem to be present in the keepalived.conf.

networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  - cidr: fd65:10:128::/56
    hostPrefix: 64
  machineNetwork:
  - cidr: 192.168.110.0/23
  - cidr: fd65:a1a8:60ad::/112
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
  - fd65:172:16::/112
platform:
  vsphere:
    apiVIPs:
    - 192.168.110.116
    - fd65:a1a8:60ad:271c::1116
    ingressVIPs:
    - 192.168.110.117
    - fd65:a1a8:60ad:271c::1117
    vcenters:
    - datacenters:
      - IBMCloud
      server: ibmvcenter.vmc-ci.devcluster.openshift.com

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-21-084440

How reproducible:

Frequently.
2 failures out of 3 attemps.

Steps to Reproduce:

1. Install vSphere dual-stack with dual VIPs, see above config
2. Check keepalived.conf
for f in $(oc get pods -n openshift-vsphere-infra -l app=vsphere-infra-vrrp --no-headers -o custom-columns=N:.metadata.name  ) ; do oc -n openshift-vsphere-infra exec -c keepalived $f -- cat /etc/keepalived/keepalived.conf | tee $f-keepalived.conf ; done

Actual results:

IPv6 VIP is not in keepalived.conf

Expected results:

vrrp_instance rbrattai_INGRESS_1 {
    state BACKUP
    interface br-ex
    virtual_router_id 129
    priority 20
    advert_int 1

    unicast_src_ip fd65:a1a8:60ad:271c::cc
    unicast_peer {
        fd65:a1a8:60ad:271c:9af:16a9:cb4f:d75c
        fd65:a1a8:60ad:271c:86ec:8104:1bc2:ab12
        fd65:a1a8:60ad:271c:5f93:c9cf:95f:9a6d
        fd65:a1a8:60ad:271c:bb4:de9e:6d58:89e7
        fd65:a1a8:60ad:271c:3072:2921:890:9263
    }
...
    virtual_ipaddress {
        fd65:a1a8:60ad:271c::1117/128
    }
...
}

Additional info:

See OPNET-207

Description of problem:

The service project and the host project both have a private DNS zone named as "ipi-xpn-private-zone". The thing is, although platform.gcp.privateDNSZone.project is set as the host project, the installer checks the zone of the service project, and complains dns name not match. 

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install 4.12.0-0.nightly-2022-10-25-210451
built from commit 14d496fdaec571fa97604a487f5df6a0433c0c68
release image registry.ci.openshift.org/ocp/release@sha256:d6cc07402fee12197ca1a8592b5b781f9f9a84b55883f126d60a3896a36a9b74
release architecture amd64

How reproducible:

Always, if both the service project and the host project have a private DNS zone with the same name.

Steps to Reproduce:

1. try IPI installation to a shared VPC, using "privateDNSZone" of the host project

Actual results:

$ openshift-install create cluster --dir test7
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.privateManagedZone: Invalid value: "ipi-xpn-private-zone": dns zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. did not match expected jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com 
$ 

Expected results:

The installer should check the private zone in the specified project (i.e. the host project).

Additional info:

$ yq-3.3.0 r test7/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
  computeSubnet: installer-shared-vpc-subnet-2
  controlPlaneSubnet: installer-shared-vpc-subnet-1
  createFirewallRules: Disabled
  publicDNSZone:
    id: qe-shared-vpc
    project: openshift-qe-shared-vpc
  privateDNSZone:
    id: ipi-xpn-private-zone
    project: openshift-qe-shared-vpc
  network: installer-shared-vpc
  networkProjectID: openshift-qe-shared-vpc
$ yq-3.3.0 r test7/install-config.yaml baseDomain
qe-shared-vpc.qe.gcp.devcluster.openshift.com
$ yq-3.3.0 r test7/install-config.yaml metadata
creationTimestamp: null
name: jiwei-1027a
$ 
$ openshift-install create cluster --dir test7
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.privateManagedZone: Invalid value: "ipi-xpn-private-zone": dns zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. did not match expected jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com 
$ 
$ gcloud --project openshift-qe-shared-vpc dns managed-zones list --filter='name=qe-shared-vpc'
NAME           DNS_NAME                                        DESCRIPTION  VISIBILITY
qe-shared-vpc  qe-shared-vpc.qe.gcp.devcluster.openshift.com.               public
$ gcloud --project openshift-qe-shared-vpc dns managed-zones list --filter='name=ipi-xpn-private-zone'
NAME                  DNS_NAME                                                    DESCRIPTION                         VISIBILITY
ipi-xpn-private-zone  jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com.  Preserved private zone for IPI XPN  private
$ gcloud dns managed-zones list --filter='name=ipi-xpn-private-zone'
NAME                  DNS_NAME                                       DESCRIPTION                         VISIBILITY
ipi-xpn-private-zone  jiwei-1026a.qe1.gcp.devcluster.openshift.com.  Preserved private zone for IPI XPN  private
$ 
$ gcloud --project openshift-qe-shared-vpc dns managed-zones describe qe-shared-vpc
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2020-04-26T02:50:25.172Z'
description: ''
dnsName: qe-shared-vpc.qe.gcp.devcluster.openshift.com.
id: '7036327024919173373'
kind: dns#managedZone
name: qe-shared-vpc
nameServers:
- ns-cloud-b1.googledomains.com.
- ns-cloud-b2.googledomains.com.
- ns-cloud-b3.googledomains.com.
- ns-cloud-b4.googledomains.com.
visibility: public
$ 
$ gcloud --project openshift-qe-shared-vpc dns managed-zones describe ipi-xpn-private-zone         
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2022-10-27T08:05:18.332Z'
description: Preserved private zone for IPI XPN
dnsName: jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com.
id: '5506116785330943369'
kind: dns#managedZone
name: ipi-xpn-private-zone
nameServers:
- ns-gcp-private.googledomains.com.
privateVisibilityConfig:
  kind: dns#managedZonePrivateVisibilityConfig
  networks:
  - kind: dns#managedZonePrivateVisibilityConfigNetwork
    networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
visibility: private
$ 
$ gcloud dns managed-zones describe ipi-xpn-private-zone
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2022-10-26T06:42:52.268Z'
description: Preserved private zone for IPI XPN
dnsName: jiwei-1026a.qe1.gcp.devcluster.openshift.com.
id: '7663537481778983285'
kind: dns#managedZone
name: ipi-xpn-private-zone
nameServers:
- ns-gcp-private.googledomains.com.
privateVisibilityConfig:
  kind: dns#managedZonePrivateVisibilityConfig
  networks:
  - kind: dns#managedZonePrivateVisibilityConfigNetwork
    networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
visibility: private
$ 

 

 

Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".

We shouldn't block operator installation if permission issues prevent dynamic plugin installation.

This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".

See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD

Description of problem:

When doing openshift-install agent create image, one should not need to provide platform specific data like boot MAC addresses.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Create install-config with only VIPs in Baremetal platform section

apiVersion: v1
metadata:
  name: foo
baseDomain: test.metalkube.org
networking:
  clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
  machineNetwork:
    - cidr: 192.168.122.0/23
  networkType: OpenShiftSDN
  serviceNetwork:
    - 172.30.0.0/16
compute:
  - architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform: {}
    replicas: 0
controlPlane:
  name: master
  replicas: 3
  hyperthreading: Enabled
  architecture: amd64
platform:
  baremetal:
    apiVIPs:
      - 192.168.122.10
    ingressVIPs:
      - 192.168.122.11
---
apiVersion: v1beta1
metadata:
  name: foo
rendezvousIP: 192.168.122.14

2.openshift-install agent create image

Actual results:

ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors 
ERROR failed to fetch Agent Installer ISO: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: [platform.baremetal.hosts: Invalid value: []*baremetal.Host(nil): bare metal hosts are missing, platform.baremetal.Hosts: Required value: not enough hosts found (0) to support all the configured ControlPlane replicas (3)]

Expected results:

Image gets generated

Additional info:

We should go into install-config validation code, detect if we are doing agent-based installation and skip the hosts checks

Description of problem:

when fs full, update-dns-resolver fails to build a proper /etc/hosts, resulting in /etc/hosts only containing the openshift-generated-node-resolver lines, missing the localhost lines.

This causes issues on pods having hostNetwork: true  like openstack-cinder-csi-driver-controller

Version-Release number of selected component (if applicable):

OpenShift 4.10.39

How reproducible:

See: https://github.com/openshift/cluster-dns-operator/blob/a5ea3fcb7be49a12115bd6648403df3d65661542/assets/node-resolver/update-node-resolver.sh

Steps to Reproduce:

1. make sure the file system is full when running the cp at line 13
2. 
3.

Actual results:

/etc/hosts is missing the localhost lines

Expected results:

/etc/hosts should contain the localhost lines

Additional info:


Owner: Architect:

Story (Required)

As an OpenShift user i would like to list helm releases as ususal.

Background (Required)

We are currently making use of `/api/helm/releases` endpoint to return helm releases to ui. The same api is used in topology too to create a map. In topology we show the manifests, release notes along with the other release information. We are not showing any such information on list releases page and hence returning the complete data to ui does not make sense.

Glossary

<List of new terms and definition used in this story>

Out of scope

<Defines what is not included in this story>

In Scope

<Defines what is included in this story>

Approach(Required)

We can add a flag called as isTopology bool flag to the network call and return the complete information if the call is made from topology else return limited information.

Demo requirements(Required)

Not Applicable

Dependencies

None

Edge Case

_<Describe edge cases to consider when implementing the story and defining
tests>_

Acceptance Criteria

Api should return response as before and the user experience remains the same.

Document is updated to reflect the change in request body.

Oc helm cli changes.

Development:

QE:
Documentation: Yes/No (needs-docs|upstream-docs / no-doc)

Upstream: <Inputs/Requirement details: Concept/Procedure>/ Not
Applicable

Downstream: <Inputs/Requirement details: Concept/Procedure>/ Not
Applicable

Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

v

Legend

Unknown

Verified

Unsatisfied

Description of problem:

Configure certificate for installer Service Principal with passphrase, when only adding clientCertificate into osServicePrincipal.json, installer give the message which is unclear that what's missing and redirect to ask subscription id.

$ ./openshift-install create install-config --dir ipi-test
INFO Could not get an azure authorizer from file: auth file missing client and certificate credentials 
INFO Asking user to provide authentication info   
? azure subscription id [? for help] 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-17-090603

How reproducible:

Always

Steps to Reproduce:

1. generate certs with passphrase and configure it in SP
2. update osServicePrincipal.json without adding clientCertificatePassword, since this parameter is optional.(Actually, if certs with passphrase, it is required.)
3. create install-config file

Actual results:

The message is unclear what's missing when getting an auth.

Expected results:

The message should give more details that client certificate is found, but failed to auth, make sure that path is correct or clientCertificatePassword is required if certs with passphrase, something like that, so that user know how to correct the file.

Additional info:

 

 

 

 

This is a clone of issue OCPBUGS-10414. The following is the description of the original issue:

Description of problem:

Coredns template implementations using incorrect Regex for resolving dot [.] character

Version-Release number of selected component (if applicable):

NA

How reproducible:

100% when you use router sharding with domains including apps

Steps to Reproduce:

1. Create an additional IngressRouter with domains names including apps. for ex: example.test-apps.<clustername>.<clusterdomain>
2. Create and configure the external LB corresponding to the additonal IngressController 
3. Configure the corporate DNS server and create records for the this additional IngressController resolving to the LB Ip setup in step 2 above.  
4. Try resolving the additional domain routes from outside cluster and within cluster, the DNS resolution works fine fro outside cluster. However within cluster all additional domains consisting apps in the domain name resolve to the default ingress VIP instead of their corresponding LB IPs configured on the corportae DNS server.

As an alternate and simple test to reroduce you can reproduce it simply by using the dig command on the cluster node with the additinal domain

for ex: 
sh-4.4# dig test.apps-test..<clustername>.<clusterdomain> 

Actual results:

DNS resolved all the domains consisting of apps to the defult Ingres VIP for example: example.test-apps.<clustername>.<clusterdomain> resolves to default ingressVIP instead of their actual coresponding LB IP.

Expected results:

DNS should resolve it to coresponding LB IP configured at the DNS server.

Additional info:

The DNS solution is happenng using the CoreFile Templates used on the node. which is treating dot(.) as character instead of actual dot[.] this is a Regex configuration bug inside CoreFile used on Vspehere IPI clusters.

Description of problem:

issue is found when verify OCPBUGS-1321 in azure cluster, found enP.* devices

# oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}"
Azure

# oc get network cluster -o jsonpath="{.spec.networkType}"
OpenShiftSDN

# token=`oc create token prometheus-k8s -n openshift-monitoring`  
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=group by(device) (node_network_info)' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "device": "lo"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "enP49203s1"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "eth0"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "enP30235s1"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "enP12948s1"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "enP51324s1"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "enP21301s1"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "enP26677s1"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "vxlan_sys_4789"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      },
      {
        "metric": {
          "device": "ovs-system"
        },
        "value": [
          1666319847.258,
          "1"
        ]
      }
    ]
  }
}

according to network QE, enP.* NIC, example: enP49203s1 is virtual NIC, we should ignore enP.* NICs from node-exporter

# oc debug node/**-k8wdf-master-0
Temporary namespace openshift-debug-lg44p is created for debugging node...
Starting pod/**-k8wdf-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ifconfig -a
enP49203s1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
        inet 10.0.0.6  netmask 255.255.255.0  broadcast 10.0.0.255
        inet6 fe80::20d:3aff:fe77:d5f0  prefixlen 64  scopeid 0x20<link>
        ether 00:0d:3a:77:d5:f0  txqueuelen 1000  (Ethernet)
        RX packets 10255342  bytes 7578264248 (7.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8300084  bytes 4603637695 (4.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
...sh-4.4# nmcli
enP49203s1: connected to Wired Connection
        "Mellanox MT27800"
        ethernet (mlx5_core), 00:0D:3A:77:D5:F0, hw, mtu 1500
        inet4 10.0.0.6/24
        route4 10.0.0.0/24 metric 101
        route4 default via 10.0.0.1 metric 101
        route4 168.63.129.16/32 via 10.0.0.1 metric 101
        route4 169.254.169.254/32 via 10.0.0.1 metric 101
        inet6 fe80::20d:3aff:fe77:d5f0/64
        route6 fe80::/64 metric 1024 

Version-Release number of selected component (if applicable):

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-10-20-104328   True        False         79m     Cluster version is 4.12.0-0.nightly-2022-10-20-104328

How reproducible:

in azure cluster

Steps to Reproduce:

1. Launch a 4.12 azure cluster.
2. Run the following PromQL query: "group by(device) (node_network_info)"

Actual results:

enP.* NICs exist in query

Expected results:

should ignore enP.* NICs from node-exporter on Azure cluster

Additional info:

 

Description of problem:

grant cluster-monitoring-view role to user pm1

# oc adm policy add-cluster-role-to-user cluster-monitoring-view pm1

login the administrator UI with pm1 user, go to  "Observe -> Targets" page, Monitor fields are blinking, debug the API, 403 error to list servicemonitors for user pm1

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "servicemonitors.monitoring.coreos.com is forbidden: User \"pm1\" cannot list resource \"servicemonitors\" in API group \"monitoring.coreos.com\" at the cluster scope",
  "reason": "Forbidden",
  "details": {
    "group": "monitoring.coreos.com",
    "kind": "servicemonitors"
  },
  "code": 403
} 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-08-002816

How reproducible:

always

Steps to Reproduce:

1. cluster-monitoring-view user, go to  "Observe -> Targets" page
2.
3.

Actual results:

cluster-monitoring-view user can not list servicemonitors

Expected results:

no error

Additional info:

not sure if we allow cluster-monitoring-view user to list servicemonitors, we can close it if it's expected

Description of problem:

Since way back in 2019, oc adm upgrade ... has been looking at ClusterVersion conditions, expecting to see Degraded. But Degraded is strictly a ClusterOperator thing. ClusterVersion fills a similar role with Failing (although it's not clear to me why folks decided against sticking with the same condition slug for the similar roles). We should pivot oc adm upgrade ... to look for the Failing that might actually exist in ClusterVersion.

Version-Release number of selected component (if applicable):

All released oc in v4.

How reproducible:

100%

Steps to Reproduce:

1. Scale down the cluster-version operator: oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator
2. Patch in a failing condition:

$ CONDITIONS="$(oc get -o json clusterversion version | jq -c '[.status.conditions[] | if .type == "Failing" then .status = "True" | .message = "Seriously bad things going on." else . end]')"
$ oc patch --subresource status clusterversion version --type json -p "[{\"op\": \"add\", \"path\": \"/status/conditions\", \"value\": ${CONDITIONS}}]"

3. Check status: oc adm upgrade
4. Ask for an update: oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:b8307ac0f3ec4ac86c3f3b52846425205022da52c16f56ec31cbe428501001d6

Actual results:

Neither step 3 nor 4 mentions the Failing=True condition.

Expected results:

Both step 3 and 4 mention the Failing=True condition, and step 4 fails requesting --allow-upgrade-with-warnings if you want to update anyway.

Opening as bug in etcd operator

Etcd is having issues bootstrapping because the bootstrap node by default picks an ipv4 address, which means traffic from the masters comes in from their ipv4 address, which doesn't match what etcd is expecting. See the following issue:

https://issues.redhat.com/browse/OPNET-215

This is a clone of issue OCPBUGS-10655. The following is the description of the original issue:

Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.

Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.

But the result also depends on the installed samples operator and installed ImageStreams.

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the Developer perspective
  2. Navigate to Add > All Samples
  3. Search for Jboss
  4. Click on "JBoss EAP XP 4.0 with OpenJDK 11" (for example)

Actual results:
The git repository is not filled and the create button is disabled.

Expected results:
Samples without git repositories should not be displayed in the list.

Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.

This is a clone of issue OCPBUGS-11719. The following is the description of the original issue:

Description of problem:

According to the slack thread attached: Cluster uninstallation is stuck when load balancers are removed before ingress controllers. This can happen when the ingress controller removal fails and the control plane operator moves on to deleting load balancers without waiting.

Code ref https://github.com/openshift/hypershift/blob/248cea4daef9d8481c367f9ce5a5e0436e0e028a/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1505-L1520

Version-Release number of selected component (if applicable):

4.12.z 4.13.z

How reproducible:

Whenever the load balancer is deleted before the ingress controller

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Load balancer deletion waits for the ingress controller deletion

Additional info:

 

Slack: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1681310121904539?thread_ts=1681216434.676009&cid=C04EUL1DRHC 

The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.

This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:

0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found  

The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.

$ oc get cm vsphere-csi-config -o yaml  -n openshift-cluster-csi-drivers | grep datacenters
    datacenters = "IBMCloud" 

 

While starting a Pipelinerun using UI, and in the process of providing the values on "Start Pipeline" , the IBM Power Customer (Deepak Shetty from IBM) has tried creating credentials under "Advanced options" with "Image Registry Credentials" (Authenticaion type). When the IBM Customer verified the credentials from  Secrets tab (in Workloads) , the secret was found in broken state. Screenshot of the broken secret is attached. 

The issue has been observed on OCP4.8, OCP4.9 and OCP4.10.

Description of problem: 

Version-Release number of selected component (if applicable): 4.10.16

How reproducible: Always

Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field

$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:

  • group: system:authenticated:oauth
    profile: AllRequestBodies
  • group: system:authenticated
    profile: AllRequestBodies
    profile: Default

2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc

Actual results:

Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)

Expected results: The command "oc get dc" should display the deploymentconfig without any error.

Additional info:

 

Description of the problem:

I have tried to deploy a spoke cluster via Assisted Service (Infra Operator) in a disconnected environment, referencing a public ocp image to install, but the assisted service pod does not recognize the icsp and fails to pull the image with:

 

time="2022-11-22T22:04:05Z" level=error msg="command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/4-dev-preview@sha256:20e270c3349fe2fcb38fd0da155329babc02d6b53e7e06ff235346c3c1cf11b5 --registry-config=/tmp/registry-config1100145791' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/4-dev-preview@sha256:20e270c3349fe2fcb38fd0da155329babc02d6b53e7e06ff235346c3c1cf11b5: Get \"https://registry.ci.openshift.org/v2/\": dial tcp 52.71.180.176:443: connect: network is unreachable\n" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeploymentsReconciler).addReleaseImage" file="/remote-source/assisted-service/app/internal/controller/controllers/5.go:1347" agent_cluster_install=test1-0 agent_cluster_install_namespace=test1-0 cluster_deployment=test1-0 cluster_deployment_namespace=test1-0 go-id=747 request_id=3b7f21c5-3727-4a84-86d8-26fe27a86d67
 

I have tried both setting the icsp in the AgentClusterInstall as an install-config override, as well as just relying on the mirror configmap , but I get the same error everytime.

This issue is coming about as attempt to test epic: https://issues.redhat.com/browse/MGMT-10209

 

Version: 

I am using MCE version registry-proxy.engineering.redhat.com/rh-osbs/multicluster-engine-mce-operator-bundle:v2.2.0-198

Which correlates to assisted service git commit 334ca79c46222bcb67e766ca03d77e59e4f098af (Contains the changes for https://issues.redhat.com/browse/MGMT-10209)

 

How reproducible:

100%

Steps to reproduce:

1. Deploy disconnected hub cluster, MCE + Assisted Service

2. Add configmap containing mirror registries.conf

3. Deploy a spoke cluster 

4. Optionally add an install-config override icsp

 

Actual results:

time="2022-11-22T22:04:05Z" level=error msg="command 'oc adm release info -o template --template '.metadata.version' --insecure=false registry.ci.openshift.org/ocp/4-dev-preview@sha256:20e270c3349fe2fcb38fd0da155329babc02d6b53e7e06ff235346c3c1cf11b5 --registry-config=/tmp/registry-config1100145791' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/4-dev-preview@sha256:20e270c3349fe2fcb38fd0da155329babc02d6b53e7e06ff235346c3c1cf11b5: Get \"https://registry.ci.openshift.org/v2/\": dial tcp 52.71.180.176:443: connect: network is unreachable\n" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeploymentsReconciler).addReleaseImage" file="/remote-source/assisted-service/app/internal/controller/controllers/5.go:1347" agent_cluster_install=test1-0 agent_cluster_install_namespace=test1-0 cluster_deployment=test1-0 cluster_deployment_namespace=test1-0 go-id=747 request_id=3b7f21c5-3727-4a84-86d8-26fe27a86d67

Expected results:

I expect either the mirror configmap or install-config icsp override are used by the internal oc client to properly mirror images

Description of problem:

AD Graph API is being deprecated and replaced by MSGraph API.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

https://issues.redhat.com/browse/CORS-1897

Description of problem:

In 4.13, when editing an application the added pipeline name is not displayed to the users

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Create an application, make sure to check the "Add Pipeline" box
2. Edit the application

Actual results:

The pipeline name is not displayed

Expected results:

The pipeline name should be displayed with it's visualization

Additional info:

Can see the pipeline name and visualization on 4.12

Description of the problem:

In staging, BE v2.11.3, in our CI - got the following error "failed to update cluster", 
"ERROR: deadlock detected (SQLSTATE 40P01); ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02); ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02); ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)"
From service logs:

time=""2022-11-13T12:13:04Z"" level=error msg=""failed to get cluster: 592ee588-84c4-4095-9c83-06a61e8cb5c5"" func=""github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).v2UpdateClusterInternal"" file=""/assisted-service/internal/bminventory/inventory.go:1753"" cluster_id=592ee588-84c4-4095-9c83-06a61e8cb5c5 error=""Failed to get cluster 592ee588-84c4-4095-9c83-06a61e8cb5c5: ERROR: deadlock detected (SQLSTATE 40P01); ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02); ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02); ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)"" go-id=131892 pkg=Inventory request_id=546c0b45-dd54-400d-8fa0-4bce584f3b91"
"time=""2022-11-13T12:13:04Z"" level=error msg=""update cluster failed"" func=""github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).v2UpdateClusterInternal.func1"" file=""/assisted-service/internal/bminventory/inventory.go:1735"" cluster_id=592ee588-84c4-4095-9c83-06a61e8cb5c5 go-id=131892 pkg=Inventory request_id=546c0b45-dd54-400d-8fa0-4bce584f3b91

How reproducible:

First time i encounter this

Steps to reproduce:

1. Not sure yet

2.

3.

Actual results:

 

Expected results:

 

 

Assisted-service can use only one mirror of the release image. In the install-config, the user may specify multiple matching mirrors. Currently the last matching mirror is the one used by assisted-service. This is confusing; we should use the first matching one instead.

Description of problem: https://github.com/openshift/console/blob/fe41d25a0fd43640e8c0a4276d2ad246e989067c/frontend/public/components/container.tsx#L247 results in `label with exit code exitCode` rendered in the browser.

Steps to Reproduce:
1. Create an crash looping pod by creating the default `example` pod using the `Create pod` button at the top of the Pods list page
2. Click the `httpd` link in the `Containers` section of the `example` pod details page
3. Note the status badge in near the page title can include `label with exit code exitCode`

Description of the problem:

In staging, BE 2.13.4, 4.12 dualstack SNO cluster fails to complete - 
getting

Host master-0-0: updated status from installing-in-progress to installing-pending-user-action (Host timed out when pulling ignition. Check the host console and verify that it boots from the OpenShift installation disk (sda, /dev/disk/by-id/wwn-0x05abcdd47f51e1c9) and has network access to the cluster API. The installation will resume once the host successfully boots and can access the cluster API) 

How reproducible:

100%

Steps to reproduce:

1. Create SNO 4.12 dualstack cluster

2.

3.

Actual results:

 

Expected results:

Description of the problem:

If the non-overlapping-subnets validation is pending, the message is "Missing inventory, or missing cluster."  This message is confusing.  The message should either be "inventory not yet received" or "host is not bound to a cluster" depending on the case.

How reproducible:

100%

Steps to reproduce:

1. Boot a host that is not bound to a cluster

2. Look at the validation

Actual results:

Validation message is vague - "Missing inventory, or missing cluster"

Expected results:

Validation message is clear - either the inventory has not been received, or the host has not yet been bound to a cluster.

Description of problem:

When running a cluster on application credentials, this event appears repeatedly:

ns/openshift-machine-api machineset/nhydri0d-f8dcc-kzcwf-worker-0 hmsg/173228e527 - pathological/true reason/ReconcileError could not find information for "ci.m1.xlarge"

Version-Release number of selected component (if applicable):

 

How reproducible:

Happens in the CI (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/33330/rehearse-33330-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.13-e2e-openstack-ovn-serial/1633149670878351360).

Steps to Reproduce:

1. On a living cluster, rotate the OpenStack cloud credentials
2. Invalidate the previous credentials
3. Watch the machine-api events (`oc -n openshift-machine-api get event`). A `Warning` type of issue could not find information for "name-of-the-flavour" will appear.

If the cluster was installed using a password that you can't invalidate:
1. Rotate the cloud credentials to application credentials
2. Restart MAPO (`oc -n openshift-machine-api get pods -o NAME | xargs -r oc -n openshift-machine-api delete`)
3. Rotate cloud credentials again
4. Revoke the first application credentials you set
5. Finally watch the events (`oc -n openshift-machine-api get event`)

The event signals that MAPO wasn't able to update flavour information on the MachineSet status.

Actual results:

 

Expected results:

No issue detecting the flavour details

Additional info:

Offending code likely around this line: https://github.com/openshift/machine-api-provider-openstack/blob/bcb08a7835c08d20606d75757228fd03fbb20dab/pkg/machineset/controller.go#L116

TRT-594 investigates failed CI upgrade runs due to alert KubePodNotReady firing.  The case was a pod getting skipped over for scheduling over two successive master node update / restarts.  The case was determined valid so the ask is to be able to have the monitoring aware that master nodes are restarting and scheduling may be delayed.   Presuming we don't want to change the existing tolerance for the non master node restart cases could we suppress it during those restarts and fall back to a second alert with increased tolerances only during those restarts, if we have metrics indicating we are restarting.  Or similar if there are better ways to handle.

The scenario is:

  • A master node (1) is out of service during upgrade
  • A pod (A) is created but can not be scheduled due to anti-affinity rules as the other nodes already host a pod of that definition
  • A second pod (B) from the same definition is created after the first
  • Pod (A) attempts scheduling but fails as the master (1) node is still updating
  • Master (1) node completes updating
  • Pod (B) attempts scheduling and succeeds
  • Next Master (2) node begins updating
  • Pod (A) can not be scheduled on the next attempt(s) as the active master nodes already have pods placed and the next master (2) node is unavailable
  • Master (2) node completes updating
  • Pod (A) is scheduled

Description of problem:

When nodeip-configuration fails it log spams syslog hindering debug.

libvirt IPI OVN bonding active-backup fail_over_mac=0.  

When ovs-configuration fails or we somehow end up with no IPs, nodeip-configuration constantly syslog spams.


Version-Release number of selected component (if applicable):


4.12.0-0.nightly-2022-10-25-121937

How reproducible:

Always

Steps to Reproduce:

1. Somehow break the network, link down all links or disable DHCP servers.
2. System has no IP addresses.
3. journalctl -f

Actual results:

Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered address fe80::1825:a3ff:fe21:2caf/64"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="retrieved Address map map[0xc00039b7a0:[127.0.0.1/8 lo ::1/128] 0xc00039b8c0:[192.0.2.42/24 enp4s0]]"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Checking whether address 192.0.2.42/24 enp4s0 contains VIP 192.168.123.5"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Checking whether address 127.0.0.1/8 lo contains VIP 192.168.123.5"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered route {Ifindex: 2 Dst: 192.0.2.0/24 Src: 192.0.2.42 Gw: <nil> Flags: [] Table: 254}"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered route {Ifindex: 1 Dst: ::1/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered route {Ifindex: 7 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Retrieved route map map[]"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered address fe80::1825:a3ff:fe21:2caf/64"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="retrieved Address map map[0xc000266480:[127.0.0.1/8 lo ::1/128] 0xc0002665a0:[192.0.2.42/24 enp4s0]]"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered route {Ifindex: 2 Dst: 192.0.2.0/24 Src: 192.0.2.42 Gw: <nil> Flags: [] Table: 254}"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered route {Ifindex: 1 Dst: ::1/128 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Ignoring filtered route {Ifindex: 7 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254}"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=debug msg="Retrieved route map map[]"
Nov 02 20:49:49 master-0-2 bash[2059]: time="2022-11-02T20:49:49Z" level=error msg="Failed to find a suitable node IP"


journalctl | grep -c -F 'level=error msg="Failed to find a suitable node IP"'
32278

Expected results:


Exponential back-off?  Some other kind of failure mode that doesn't fill journald.

Additional info:


Description of problem:

Pulling in the changes that add the openshift/nodes/realtime test suite for 4.13 to allow for realtime testing. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Very

Steps to Reproduce:

1. Run the openshift-test binary with the openshift/nodes/realtime suite specified
2.
3.

Actual results:

error: suite "openshift/nodes/realtime" does not exist

Expected results:

The test suite to run

Additional info:

 

Description of the problem:
While installing SNO on 4.12 with LVMS , the lvmo operator shown when running

oc get csv -n openshift-storage

odf-lvm-operator.v4.11.4 4.11.4
 
It Should be 4.12

How reproducible:

 

Steps to reproduce:

1. create sno cluster with ocp ver4.12

2. select scnv and lvmo operators

3.

Actual results:
lvms operator is 4.11.4
 

Expected results:
lvms operator should be 4.12.0

Description of problem: As discovered in https://issues.redhat.com/browse/OCPBUGS-2795, gophercloud fails to list swift containers when the endpoint speaks HTTP2. This means that CIRO will provision a 100GB cinder volume even though swift is available to the tenant.

We're for example seeing this behavior in our CI on vexxhost.

The gophercloud commit that fixed this issue is https://github.com/gophercloud/gophercloud/commit/b7d5b2cdd7ffc13e79d924f61571b0e5f74ec91c, specifically the `|| ct == ""` part on line 75 of openstack/objectstorage/v1/containers/results.go. This commit made it in gophercloud v0.18.0.

CIRO still depends on gophercloud v0.17.0. We should bump gophercloud to fix the bug.

Version-Release number of selected component (if applicable):

All versions. Fix should go to 4.8 - 4.12.

How reproducible:

Always, when the object storage omits the 'content-type' header. This might happen with responses bearing a HTTP status code 204, when Swift is exposed behind a reverse proxy that truncates 'content-type' headers for that specific response code. See https://trac.nginx.org/nginx/ticket/2109#no1 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-12904. The following is the description of the original issue:

Description of problem:

In order to test proxy installations, the CI base image for OpenShift on OpenStack needs netcat.

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/419

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:
The pipelines -> repositories list view in Dev Console does not show the running pipelineline as the last pipelinerun in the table.

Original BugZilla Link: https://bugzilla.redhat.com/show_bug.cgi?id=2016006
OCPBUGSM: https://issues.redhat.com/browse/OCPBUGSM-36408

Description of problem:

The Machine Config Operator (MCO) makes use of the /etc/os-release and /usr/lib/os-release files to determine the underlying node OS so that it is possible to do branching based upon a different OS version. The files are read using github.com/ashcrow/osrelease and then the ID, VARIANT_ID, and VERSION_ID fields are thinly wrapped with some helper functions.

The helper functions appear to infer the RHEL version from the VERSION_ID field, based upon their names. For example, there is a function called IsEL9(), which checks if the VERSION_ID field is equal to 9. Furthermore, the unit tests for the helper functions assume that the VERSION_ID field is populated with the RHEL_VERSION field, not the VERSION_ID field. However in practice, the VERSION_ID field appears to have the OpenShift version in it, which breaks that assumption.

For example, the /etc/os-release and /usr/lib/os-release files contain the following information for an OpenShift 4.12 CI build:

NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202301311551-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202301311551-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202301311551-0"

Notice that the VERSION_ID contains the OCP version; not the RHEL version.

 

How reproducible:

Always

Steps to Reproduce:

  1. Launch a new cluster running on RHCOS 9 (Run clusterbot launch against PR: https://github.com/openshift/machine-config-operator/pull/3485)
  2. Get the /etc/os-release file content from a random node:
$ oc debug "node/$(oc get nodes -o=jsonpath='{.items[0].metadata.name}')" -- cat /host/etc/os-release
  1. Use the Go code at https://gist.github.com/cheesesashimi/89184074cd2fe066232c512db4969015, to read the contents, modifying it to include the contents of the RHCOS9 /etc/os-release file retrieved in Step 2.

 

Actual results:

NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="413.90.202212151724-0"
VERSION_ID="4.13"
VARIANT="CoreOS"
VARIANT_ID=coreos
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 413.90.202212151724-0 (Plow)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.13/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.13"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.13"
OPENSHIFT_VERSION="4.13"
RHEL_VERSION="9.0"
OSTREE_VERSION="413.90.202212151724-0"

(daemon.OperatingSystem) {
 ID: (string) (len=5) "rhcos",
 VariantID: (string) (len=6) "coreos",
 VersionID: (string) (len=4) "4.13"
}

IsEL(): true
IsEL9(): false
IsFCOS(): false
IsSCOS(): false
IsCoreOSVariant(): true
IsLikeTraditionalRHEL7(): false
ToPrometheusLabel(): RHCOS

Expected results:

Given the above input, I would have expected the code provided in the Gist above to produce output similar to this:

(daemon.OperatingSystem) {
  ID: (string) (len=5) "rhcos",
  VariantID: (string) (len=6) "coreos",
  VersionID: (string) (len=4) "9.0"
}

IsEL(): true 
IsEL9(): true 
IsFCOS(): false 
IsSCOS(): false 
IsCoreOSVariant(): true 
IsLikeTraditionalRHEL7(): false 
ToPrometheusLabel(): RHCOS

 

Additional info:

  • We most likely need to adjust the OperatingSystem code to look for the RHEL_VERSION, where available. However, I would like someone from the CoreOS team to review the assumptions this makes.
  • We should write an MCO e2e test that verifies this against a live node so that we're informed if anything were to change in the form of a failed test.
  • We'll also need to account for FCOS and SCOS cases as well.

This is a clone of issue OCPBUGS-11142. The following is the description of the original issue:

Description of problem:

With the recent update in the logic for considering a CPMS replica Ready only when both the backing Machine is running and the backing Node is Ready: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/171, we now need to watch nodes at all times to detect nodes transitioning in readiness.

The majority of occurrences of this issue have been fixed with: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/177 (https://issues.redhat.com//browse/OCPBUGS-10032) but we also need to watch the control plane nodes at steady state (when they are already Ready), to notice if they go UnReady at any point, as relying on control plane machine events is not enough (they might be Running, while the Node has transitioned to NotReady).

Version-Release number of selected component (if applicable):

4.13, 4.14

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

Automation CI run fail on multiple testcase, looks like the podman freezed and restart actions fail.

 

See:

https://redhat-internal.slack.com/archives/C016EF00KS5/p1674383764454839

 

How reproducible:

Always on 4.12, doesnt reproduces on 4.11

Steps to reproduce:

  1. Run CI test test_ntp_clock_out_of_sync_validation
  2. During the test we loose connection to the host because agent is down after service restart: sudo systemctl restart agent.service

Actual results:

VM and podman are freeze for a minute or two

Expected results:

Agent go to restart w/o any issue / error

 

 

I have managed to reproduce it step by step with CI tests: the automation flow:

  • take a discoved node (master) , delete it from cluster.
  • restart the agent on the "deleted node from cluster"
  • the issue appears

here are the vm logs before it happens (on master node):

http://pastebin.test.redhat.com/1089027

This is a clone of issue OCPBUGS-11083. The following is the description of the original issue:

Description of problem:

Updating performance profile test lane is flaky due to :

1. race conditions in tuned status checks:

After enabling realtime and high power consumption under workload hints in the performance profile, the test is falling since it cannot find stalld pid:
msg: "failed to run command [pidof stalld]: output \"\"; error \"\"; command terminated with exit code 1",

2. mishandled test skips when hardware is not sufficient for a test

3. Unnecessary waits for mcp status changes
in performance profile
[BeforeEach] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:370
[It] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:432
[FAILED] Timed out after 2400.001s.
Failed to find condition status by MCP "worker-test"
Expected
<v1.ConditionStatus>: False
to equal
<v1.ConditionStatus>: True
In [BeforeEach] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go:429 @ 04/03/23 04:55:54.23
 

Version-Release number of selected component (if applicable):

Openshift 4.14, 4.13

How reproducible:

Often (Flaky test)

See this comment for some updated information

Description of problem:
During IPI installation on IBM Cloud (x86_64), some of the worker machines have been seen to have no network connectivity during their initial bootup. Investigations were performed with IBM Cloud VPC to attempt to identify the issue, but in all appearances, all virtualization appears to be working.

Unfortunately due to this issue, no network traffic, no access to these worker machines is available to help identify the issue (Ignition is stuck without network traffic), so no SSH or console login is available to collect logs, or perform any testing on these machines.

The only content available is the console output, showing ignition is stuck due to the network issue.

Version-Release number of selected component (if applicable):
4.12.0

How reproducible:
About 60%

Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud
2. Wait for the worker machines to be provisioned, causing IPI to fail waiting on machine-api operator
3. Check console of worker machines failing to report in to cluster (in this case 2 of 3 failed)

Actual results:
IPI creation failed waiting on machine-api operator to complete all worker node deployment

Expected results:
Successful IPI creation on IBM Cloud

Additional info:
As stated, investigation was performed by IBM Cloud VPC, but no further investigation could be performed since no access to these worker machines is available. Any further details that could be provided to help identify the issue would be helpful.

This appears to have become more prominent recently as well, causing concern for IBM Cloud's IPI GA support on the 4.12 release.

The only solution to restore network connectivity is rebooting the machine, which loses ignition bring up (I assume it must be triggered manually now), and in the case of IPI, isn't a great mitigation.

Description of problem:

Events.Events: event view displays created pod
https://search.ci.openshift.org/?search=event+view+displays+created+pod&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.Run event scenario tests and note below results: 

Actual results:

{Expected '' to equal 'test-vjxfx-event-test-pod'. toEqual Error: Failed expectation
    at /go/src/github.com/openshift/console/frontend/integration-tests/tests/event.scenario.ts:65:72
    at Generator.next (<anonymous>:null:null)
    at fulfilled (/go/src/github.com/openshift/console/frontend/integration-tests/tests/event.scenario.ts:5:58)
    at runMicrotasks (<anonymous>:null:null)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
   }

Expected results:

 

Additional info:

 

 

Description of problem:

Fail to deploy IPI azure cluster, where set region as westus3, vm type as NV8as_v4. Master node is running from azure portal, but could not ssh login. From serials log, get below error:

[ 3009.547219] amdgpu d1ef:00:00.0: amdgpu: failed to write reg:de0
[ 3011.982399] mlx5_core 6637:00:02.0 enP26167s1: TX timeout detected
[ 3011.987010] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 0, SQ: 0x170, CQ: 0x84d, SQ Cons: 0x823 SQ Prod: 0x840, usecs since last trans: 2418884000
[ 3011.996946] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 1, SQ: 0x175, CQ: 0x852, SQ Cons: 0x248c SQ Prod: 0x24a7, usecs since last trans: 2148366000
[ 3012.006980] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 2, SQ: 0x17a, CQ: 0x857, SQ Cons: 0x44a1 SQ Prod: 0x44c0, usecs since last trans: 2055000000
[ 3012.016936] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 3, SQ: 0x17f, CQ: 0x85c, SQ Cons: 0x405f SQ Prod: 0x4081, usecs since last trans: 1913890000
[ 3012.026954] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 4, SQ: 0x184, CQ: 0x861, SQ Cons: 0x39f2 SQ Prod: 0x3a11, usecs since last trans: 2020978000
[ 3012.037208] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 5, SQ: 0x189, CQ: 0x866, SQ Cons: 0x1784 SQ Prod: 0x17a6, usecs since last trans: 2185513000
[ 3012.047178] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 6, SQ: 0x18e, CQ: 0x86b, SQ Cons: 0x4c96 SQ Prod: 0x4cb3, usecs since last trans: 2124353000
[ 3012.056893] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 7, SQ: 0x193, CQ: 0x870, SQ Cons: 0x3bec SQ Prod: 0x3c0f, usecs since last trans: 1855857000
[ 3021.535888] amdgpu d1ef:00:00.0: amdgpu: failed to write reg:e15
[ 3021.545955] BUG: unable to handle kernel paging request at ffffb57b90159000
[ 3021.550864] PGD 100145067 P4D 100145067 PUD 100146067 PMD 0 

From azure doc https://learn.microsoft.com/en-us/azure/virtual-machines/nvv4-series , looks like nvv4 series only supports Window VM.

 

Version-Release number of selected component (if applicable):

4.12 nightly build

How reproducible:

Always

Steps to Reproduce:

1. prepare install-config.yaml, set region as westus3, vm type as NV8as_v4 2. install cluster
3.

Actual results:

installation failed

Expected results:

If nvv4 series is not supported for Linux VM, installer might validate and show the message that such size is not supported.

Additional info:

 

 

 

 

 

currently we get the following message when the IsConnected check is failing during
installation:

 Host failed to install due to timeout while connecting to host

The following error is generated in the following transitions:
1. installing --> if IsConnected failed
2. preparing to install + preparing successfully --> if IsConnected failed
3. installing + installing in progress --> if 
HostNotResponsiveWhileInstallation condition is failing and we are not in rebooting stage or beyond

While improving the message please also verify that in stage 1 we are not reporting the disconnection after reboot and that there are no race conditions with reboot which may lead to spurious.

Host state dashboard can assist in finding actual cases where this error happens
 

 

 

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/510

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On azure, delete a master, old machine stuck in Deleting, some pods in cluster are in ImagePullBackOff, check from azure console, new master did not add into lb backend, seems this lead the machine has no internet connection.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-12-024338

How reproducible:

Always

Steps to Reproduce:

1. Set up a cluster on Azure, networkType ovn
2. Delete a master
3. Check master and pod

Actual results:

Old machine stuck in Deleting,  some pods are in ImagePullBackOff.
 $ oc get machine    
NAME                                    PHASE      TYPE              REGION   ZONE   AGE
zhsunaz2132-5ctmh-master-0              Deleting   Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-1              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-2              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-flqqr-0        Running    Standard_D8s_v3   westus          105m
zhsunaz2132-5ctmh-worker-westus-dhwfz   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-dw895   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-xlsgm   Running    Standard_D4s_v3   westus          152m

$ oc describe machine zhsunaz2132-5ctmh-master-flqqr-0  -n openshift-machine-api |grep -i "Load Balancer"
      Internal Load Balancer:  zhsunaz2132-5ctmh-internal
      Public Load Balancer:      zhsunaz2132-5ctmh

$ oc get node            
NAME                                    STATUS     ROLES                  AGE    VERSION
zhsunaz2132-5ctmh-master-0              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-1              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-2              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-flqqr-0        NotReady   control-plane,master   109m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dhwfz   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dw895   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-xlsgm   Ready      worker                 152m   v1.26.0+149fe52
$ oc describe node zhsunaz2132-5ctmh-master-flqqr-0
  Warning  ErrorReconcilingNode       3m5s (x181 over 108m)  controlplane         [k8s.ovn.org/node-chassis-id annotation not found for node zhsunaz2132-5ctmh-master-flqqr-0, macAddress annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0" , k8s.ovn.org/l3-gateway-config annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0"]

$ oc get po --all-namespaces | grep ImagePullBackOf   
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-l8ng4                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-csi-drivers                      azure-file-csi-driver-node-99k82                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-node-tuning-operator             tuned-bvvh7                                                       0/1     ImagePullBackOff        0              113m
openshift-dns                                      node-resolver-2p4zq                                               0/1     ImagePullBackOff        0              113m
openshift-image-registry                           node-ca-vxv87                                                     0/1     ImagePullBackOff        0              113m
openshift-machine-config-operator                  machine-config-daemon-crt5w                                       1/2     ImagePullBackOff        0              113m
openshift-monitoring                               node-exporter-mmjsm                                               0/2     Init:ImagePullBackOff   0              113m
openshift-multus                                   multus-4cg87                                                      0/1     ImagePullBackOff        0              113m
openshift-multus                                   multus-additional-cni-plugins-mc6vx                               0/1     Init:ImagePullBackOff   0              113m
openshift-ovn-kubernetes                           ovnkube-master-qjjsv                                              0/6     ImagePullBackOff        0              113m
openshift-ovn-kubernetes                           ovnkube-node-k8w6j                                                0/6     ImagePullBackOff        0              113m

Expected results:

Replace master successful

Additional info:

Tested payload 4.13.0-0.nightly-2023-02-03-145213, same result.
Before we have tested in 4.13.0-0.nightly-2023-01-27-165107, all works well.

In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

 

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

Description of problem:

vSphere 4.12 CI jobs are failing with:
admission webhook "validation.csi.vsphere.vmware.com" denied the request: AllowVolumeExpansion can not be set to true on the in-tree vSphere StorageClass

https://search.ci.openshift.org/?search=can+not+be+set+to+true+on+the+in-tree+vSphere+StorageClass&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

 

Version-Release number of selected component (if applicable):

4.12 nigthlies

How reproducible:

consistently in CI

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

This appears to have started failing in the past 36 hours.

This is a clone of issue OCPBUGS-8446. The following is the description of the original issue:

Description of problem:

The certificates synced by MCO in 4.13 onwards are more comprehensive and correct, and out of sync issues will surface much faster.

See https://issues.redhat.com/browse/MCO-499 for details

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.Install 4.13, pause MCPs
2.
3.

Actual results:

Within ~24 hours the cluster will fire critical clusterdown alerts

Expected results:

No alerts fire

Additional info:

 

This is a clone of issue OCPBUGS-11038. The following is the description of the original issue:

Description of problem:

Backport support starting in 4.12.z to a new GCP region europe-west12

Version-Release number of selected component (if applicable):

4.12.z and 4.13.z

How reproducible:

Always

Steps to Reproduce:

1. Use openhift-install to deploy OCP in europe-west12

Actual results:

europe-west12 is not available as a supported region in the user survey

Expected results:

europe-west12 to be available as a supported region in the user survey

Additional info:

 

This is a clone of issue OCPBUGS-10807. The following is the description of the original issue:

Description of problem:

Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations.

When CNO is managed by Hypershift, multus-admission-controller and other CNO-managed deployments should run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod security context to its managed deployments, including CNO, with runAsUser element inside. In such a case CNO should do the same, set security context for its managed deployments, like multus-admission-controller, to meet Hypershift security rules.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift using Kube management cluster
2.Check pod security context of multus-admission-controller

Actual results:

no pod security context is set on multus-admission-controller

Expected results:

pod security context is set with runAsUser: xxxx

Additional info:

Corresponding CNO change 

Description of problem:

E2E test cases for knative and pipeline packages have been disabled on CI due to respective operator installation issues. 
Tests have to be enabled after new operator version be available or the issue resolves

References:
https://coreos.slack.com/archives/C6A3NV5J9/p1664545970777239

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

On MicroShift, the Route API is served by kube-apiserver as a CRD. Reusing the same defaulting implementation as vanilla OpenShift through a patch to kube- apiserver is expected to resolve OCPBUGS-4189 but have no detectable effect on OCP.

Additional info:

This patch will be inert on OCP, but is implemented in openshift/kubernetes because MicroShift ingests kube-apiserver through its build-time dependency on openshift/kubernetes.

Value Statement

Ensure the issue title clearly reflects the value of this user story to the
intended persona. (Explain the "WHY")

Implement a quick start guide to help onboard users with hosted cluster creation.

See this for more details: https://docs.google.com/document/d/1wPAtfW6vdd2fZhh2Lax8k6abLxPUgwpZvXIdWyiSyLY/edit#

Definition of Done for Engineering Story Owner (Checklist)

  • ...

Development Complete

  • The code is complete.
  • Functionality is working.
  • Any required downstream Docker file changes are made.

Tests Automated

  • [ ] Unit/function tests have been automated and incorporated into the
    build.
  • [ ] 100% automated unit/function test coverage for new or changed APIs.

Secure Design

  • [ ] Security has been assessed and incorporated into your threat model.

Multidisciplinary Teams Readiness

Support Readiness

  • [ ] The must-gather script has been updated.

Description of problem:

network-tools stopped giving ovn-k master leader info in 4.13, the tool needs to use the lease instead of the configmap for ovn-master.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.

$ IMG='quay.io/ffernand/network-tools-test:latest'

$ oc adm must-gather --image=${IMG} -- network-tools ovn-get leaders 2>/dev/null | grep leader
[must-gather-b78wd] POD 2023-02-06T19:42:07.242285507Z ovn-k master leader 
[must-gather-b78wd] POD 2023-02-06T19:42:07.669951803Z nbdb leader ovnkube-master-7cdlc
[must-gather-b78wd] POD 2023-02-06T19:42:08.113287051Z sbdb leader ovnkube-master-7cdlc 

ovn-k master leader info is missing
2.
3.

Actual results:

 

Expected results:

it should give ovn-k master leader info similar to what it gives in prior-4.13 version, for example in this 4.11 version:

[ff@sdn-12 ~]$ oc adm must-gather --image=${IMG} -- network-tools ovn-get leaders 2>/dev/null | grep leader
[must-gather-l727d] POD 2023-02-06T19:39:31.135487892Z ovn-k master leader ovnkube-master-rqsqn              <===== here!
[must-gather-l727d] POD 2023-02-06T19:39:31.758928575Z nbdb leader ovnkube-master-rqsqn
[must-gather-l727d] POD 2023-02-06T19:39:32.067661600Z sbdb leader ovnkube-master-d4txs

Additional info:

 

Persistent build failures have been detected for following components:

  • ose-csi-driver-shared-resource-container
  • ose-csi-driver-shared-resource-webhook-container

They all seem related to the same issue with go dependencies:

atomic_reactor.utils.cachito - ERROR - Request <x> is in "failed" state: Processing gomod dependencies failed 

This is a clone of issue OCPBUGS-11057. The following is the description of the original issue:

Description of problem:
When import a Serverless Service from a git repository the topology shows an Open URL decorator also when "Add Route" checkbox was unselected (which is selected by default).

The created kn Route makes the Service available within the cluster and the created URL looks like this: http://nodeinfo-private.serverless-test.svc.cluster.local

So the Service is NOT accidentally exposed. It's "just" that we link an internal route that will not be accessible to the user.

This might happen also for Serverless functions import flow and the import container image import flow.

Version-Release number of selected component (if applicable):
Tested older versions and could see this at least on 4.10+

How reproducible:
Always

Steps to Reproduce:

  1. Install the OpenShift Serverless operator and create the required kn Serving resource.
  2. Navigate to the Developer perspective > Add > Import from Git
  3. Enter a git repository (like https://gitlab.com/jerolimov/nodeinfo
  4. Unselect "Add Route" and press Create

Actual results:
The topology shows the new kn Service with a Open URL decorator on the top right corner.

The button is clickable but the target page could not be opened (as expected).

Expected results:
The topology should not show an Open URL decorator for "private" kn Routes.

The topology sidebar shows similar information, we should maybe release the Link there as well with a Text+Copy button???

A fix should be tested as well with Serverless functions as container images!

Additional info:
When the user unselects the "Add route" option an additional label is added to the kn Service. This label could also be added and removed later. When this label is specified the Open URL decorator should not be shown:

metadata:
  labels:
    networking.knative.dev/visibility: cluster-local

See also:

https://github.com/openshift/console/blob/1f6e238b924f4a4337ef917a0eba8aadae161e9c/frontend/packages/knative-plugin/src/utils/create-knative-utils.ts#L108

https://github.com/openshift/console/blob/1f6e238b924f4a4337ef917a0eba8aadae161e9c/frontend/packages/knative-plugin/src/topology/components/decorators/getServiceRouteDecorator.tsx#L15-L21

This is a clone of issue OCPBUGS-12964. The following is the description of the original issue:

Description of problem:

While installing ocp on aws user can set metadataService auth to Required in order to use IMDSv2, in that case user requires all the vms to use it. 
Currently bootstrap will always run with Optional and this can be blocked on users aws account and will fail the installation process

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Install aws cluster and set metadataService to Required

Steps to Reproduce:

1.
2.
3.

Actual results:

Bootstrap has IMDSv2 set to optional

Expected results:

All vms had IMDSv2 set to required

Additional info:

 

Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:

  "scheduling": {
    "automaticRestart": false,
    "onHostMaintenance": "MIGRATE",
    "preemptible": false,
    "provisioningModel": "STANDARD"
  },

From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:

If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".

But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:

$ git clone https://github.com/openshift/machine-api-provider-gcp.git
$ cd machine-api-provider-gcp
$ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api'
44f0f958 migrate to openshift/api

But that's not where the 4.9 and earlier code is located:

$ git branch -a | grep origin/release
  remotes/origin/release-4.10
  remotes/origin/release-4.11
  remotes/origin/release-4.12
  remotes/origin/release-4.13

Hunting for 4.9 code:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp
  gcp-machine-controllers                        https://github.com/openshift/cluster-api-provider-gcp                       c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6
  gcp-pd-csi-driver                              https://github.com/openshift/gcp-pd-csi-driver                              48d49f7f9ef96a7a42a789e3304ead53f266f475
  gcp-pd-csi-driver-operator                     https://github.com/openshift/gcp-pd-csi-driver-operator                     d8a891de5ae9cf552d7d012ebe61c2abd395386e

So looking there:

$ git clone https://github.com/openshift/cluster-api-provider-gcp.git
$ cd cluster-api-provider-gcp
$ git log --oneline | grep 'migrate to openshift/api'
...no hits...
$ git grep -i automaticRestart origin/release-4.9  | grep -v '"description"\|compute-gen.go'
origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json:        "automaticRestart": {

Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.

Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603

Long log lines get corrupted when using '--timestamps' by the Kubelet.

The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.

https://github.com/kubernetes/kubernetes/blob/f892ab1bd7fd97f1fcc2e296e85fdb8e3e8fb82d/pkg/kubelet/kuberuntime/logs/logs.go#L325

apiVersion: v1
kind: Pod
metadata:
  name: logs
spec:
  restartPolicy: Never
  containers:
  - name: logs
    image: fedora
    args:
    - bash
    - -c
    - 'for i in `seq 1 10000000`; do echo -n $i; done'
kubectl logs logs --timestamps

Description of the problem:

When upgrading SNO installed via assisted-installer to the EC version of 4.13, the local dnsmasq process in the node is not listening on all interfaces, and only listens for localhost loopback.

It makes kubelet and kube-apiserver unable to resolve the fqdn and api/api-int by locally requesting dns resolution from the dnsmasq process.

How reproducible:

100%

Steps to reproduce:

  1. Upgrade to registry.ci.openshift.org/rhcos-devel/ocp-4.13-9.0:4.13.0-ec.1

Actual results:

SNO upgrade will stuck and does not proceed.

Expected results:

Successful SNO upgrade.

This is a clone of issue OCPBUGS-10787. The following is the description of the original issue:

STATUS: We basically know the general shape of what we need to do, and PoC work exists to do it and is queued up in https://github.com/openshift/machine-config-operator/pull/3650 for testing.  However, uncertainty remains around:

 

  • Whether PoC code actually works e2e
  • The potential blast radius of the changes that could affect other scenarios (trying to minimize)

 

Description of problem:

Upgrades to from OpenShift 4.12 to 4.13 will also upgrade the underlying RHCOS from 8.6 to 9.2. As part of that the names of the network interfaces may change. For example `eno1` may be renamed to `eno1np0`. If a host is using NetworkManager configuration files that rely on those names then the host will fail to connect to the network when it boots after the upgrade. For example, if the host had static IP addresses assigned it will instead boot using IP addresses assigned via DHCP.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always.

Steps to Reproduce:

1. Select hardware (or VMs) that will have different network interface names in RHCOS 8 and RHCOS 9, for example `eno1` in RHCOS 8 and `eno1np0` in RHCOS 9.

1. Install a 4.12 cluster with static network configuration using the `interface-name` field of NetworkManager interface configuration files to match the configuration to the network interface.

2. Upgrade the cluster to 4.13.

Actual results:

The NetworkManager configuration files are ignored because they don't longer match the NIC names. Instead the NICs get new IP addresses from DHCP.

Expected results:

The NetworkManager configuration files are updated as part of the upgrade to use the new NIC names.

Additional info:

Note this a hypothetical scenario. We have detected this potential problem in a slightly different scenario where we install a 4.13 cluster with the assisted installer. During the discovery phase we use RHCOS 8 and we generate the NetworkManager configuration files. Then we reboot into RHCOS 9, and the configuration files are ignored due to the change in the NICs. See MGMT-13970 for more details.

Description of problem:

When use agent based installer to bringup an openshift cluster on bare matel, if the server's BMC support lan over USB. It will create a virtual interface on the operating system with IP address in the 169.254.0.0/16 range, typically 169.254.0.2, and the iDRAC will be at 169.254.0.1.
For example:

# ip a s
...SNIP...
6: idrac: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether ec:2a:72:04:23:87 brd ff:ff:ff:ff:ff:ff
    inet 169.254.1.2/16 brd 169.254.255.255 scope global idrac
       valid_lft forever preferred_lft forever
...SNIP...

Then the assisted installer agent will pick such a interface to check the connectivity. Because the link-local address is a network address that is valid only for communications within the subnetwork, it will not reach the other node's eth0(for example, ip address is 20.12.32.10). Therefore, the connectivity check fails. 

The "LAN over USB" interface is only used to communicatie between node and BMC. So, it should not be pick to do connectivity check.

Version-Release number of selected component (if applicable):

4.11.0

How reproducible:

100%

Steps to Reproduce:

1. Mount agent.iso or discovery.iso to a server support Lan over usb.
Or
1. Mount agent.iso or discovery.iso to an VM.
2. Create a dummy interface with link-local address.

Actual results:

connectivity check fails

Expected results:


connectivity check pass

Additional info:

 

Console should be using v1 version of the ConsolePlugin model rather then the old v1alpha1.

CONSOLE-3077 was updating this version, but did not made the cut for the 4.12 release. Based on discussion with Samuel Padgett we should be backporting to 4.12.

 

The risk should be minimal since we are only updating the model itself + validation + Readme

Description of problem:

The statsPort is not correctly set for HostNetwork endpointPublishingStrategyWhen we change the httpPort from 80 to 85 and statsPort from 1936 to 1939 on the default router like here: # oc get IngressController default -n openshift-ingress-operator
...
 clientTLS:
    clientCA:
      name: ""
    clientCertificatePolicy: ""
  endpointPublishingStrategy:
    hostNetwork:
      httpPort: 85
      httpsPort: 443
      statsPort: 1939
    type: HostNetwork
...
status:
...  
endpointPublishingStrategy:
    hostNetwork:
      httpPort: 85
      httpsPort: 443
      protocol: TCP
      statsPort: 1939
 
We can see that the route pods get restarted:# oc get pod -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-5b96855754-2wnrp   1/1     Running   0          1m
router-default-5b96855754-9c724   1/1     Running   0          2mThe pods are configured correctly:# oc get pod router-default-5b96855754-2wnrp -o yaml
...
spec:
  containers:
  - env:
    - name: ROUTER_SERVICE_HTTPS_PORT
      value: "443"
    - name: ROUTER_SERVICE_HTTP_PORT
      value: "85"
    - name: STATS_PORT
      value: "1939"
...
    livenessProbe:
      failureThreshold: 3
      httpGet:
        host: localhost
        path: /healthz
        port: 1939
        scheme: HTTP
...
    ports:
    - containerPort: 85
      hostPort: 85
      name: http
      protocol: TCP
    - containerPort: 443
      hostPort: 443
      name: https
      protocol: TCP
    - containerPort: 1939
      hostPort: 1939
      name: metrics
      protocol: TCPBut the endpoint is incorrect:# oc get ep router-internal-default -o yaml
...
apiVersion: v1
items:
- apiVersion: v1
  kind: Endpoints
  metadata:
    creationTimestamp: "2022-12-02T13:34:48Z"
    labels:
      ingresscontroller.operator.openshift.io/owning-ingresscontroller: default
    name: router-internal-default
    namespace: openshift-ingress
    resourceVersion: "23216275"
    uid: 50c00fc0-08e5-4a6a-a7eb-7501fa1a7ba6
  subsets:
  - addresses:
    - ip: 10.74.211.203
      nodeName: worker-0.rhodain01.lab.psi.pnq2.redhat.com
      targetRef:
        kind: Pod
        name: router-default-5b96855754-2wnrp
        namespace: openshift-ingress
        uid: eda945b9-9061-4361-b11a-9d895fee0003
    - ip: 10.74.211.216
      nodeName: worker-1.rhodain01.lab.psi.pnq2.redhat.com
      targetRef:
        kind: Pod
        name: router-default-5b96855754-9c724
        namespace: openshift-ingress
        uid: 97a04c3e-ddea-43b7-ac70-673279057929
    ports:
    - name: metrics
      port: 1936
      protocol: TCP
    - name: https
      port: 443
      protocol: TCP
    - name: http
      port: 85
      protocol: TCP
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""Notice that the https port is correctly set to 85, but the stats port is still set to 1936 and not to 1939. That is a problem as the metrics target endpoint is reported as down with an error message:    Get "https://10.74.211.203:1936/metrics": dial tcp 10.74.211.203:1936: connect: connection refusedWhen the EP is corrected and the ports are changed to:
  ports:
  - name: metrics
    port: 1939
    protocol: TCPthe metrics target endpoint is picked up correctly and the metrics are scribed works as expected

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

Set endpointPublishingStrategy and modify the nodeport for statPort:

endpointPublishingStrategy:
    hostNetwork:
      httpPort: 85
      httpsPort: 443
      protocol: TCP
      statsPort: 1939

 

Actual results:

Stats are scribed from the standard port and not the one specified.

Expected results:

The endpoint object is pointing to the specified port.

Additional info:

 

This is a clone of issue OCPBUGS-11442. The following is the description of the original issue:

Description of problem:

Currently: Hypershift is squashing any user configured proxy configuration based on this line: https://github.com/openshift/hypershift/blob/main/support/globalconfig/proxy.go#L21-L28, https://github.com/openshift/hypershift/blob/release-4.11/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L487-L493. Because of this any user changes to the cluster-wide proxy configuration documented here: https://docs.openshift.com/container-platform/4.12/networking/enable-cluster-wide-proxy.html are squashed and not valid for more than a few seconds. That blocks some functionality in the openshift cluster from working including application builds from the openshift samples provided in the cluster. 

 

Version-Release number of selected component (if applicable):

4.13 4.12 4.11

How reproducible:

100%

Steps to Reproduce:

1. Make a change to the Proxy object in the cluster with kubectl edit proxy cluster
2. Save the change
3. Wait a few seconds

Actual results:

HostedClusterConfig operator will go in and squash the value

Expected results:

The value the user provides remains in the configuration and is not squashed to an empty value

Additional info:

 

This is a clone of issue OCPBUGS-13314. The following is the description of the original issue:

Description of problem:

[vmware csi driver] vsphere-syncher does not retry populate the CSINodeTopology with topology information when registration fails

When syncer starts it watches for node events, but it does not retry if registration fails and in the meanwhile any csinodetopoligy requests might not get served, because VM is not found

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-05-04-090524

How reproducible:

Randomly

Steps to Reproduce:

1. Install OCP cluster by UPI with encrypt 
2. Check the cluster storage operator not degrade

Actual results:

cluster storage operator degrade that VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 

...
2023-05-09T06:06:22.146861934Z I0509 06:06:22.146850       1 main.go:183] ServeMux listening at "0.0.0.0:10300"
2023-05-09T06:07:00.283007138Z E0509 06:07:00.282912       1 main.go:64] failed to establish connection to CSI driver: context canceled
2023-05-09T06:07:07.283109412Z W0509 06:07:07.283061       1 connection.go:173] Still connecting to unix:///csi/csi.sock
...

# Many error logs in csi driver related timed out while waiting for topology labels to be updated in \"compute-2\" CSINodeTopology instance .

...
2023-05-09T06:19:16.499856730Z {"level":"error","time":"2023-05-09T06:19:16.499687071Z","caller":"k8sorchestrator/topology.go:837","msg":"timed out while waiting for topology labels to be updated in \"compute-2\" CSINodeTopology instance.","TraceId":"b8d9305e-9681-4eba-a8ac-330383227a23","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/common/commonco/k8sorchestrator.(*nodeVolumeTopology).GetNodeTopologyLabels\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/common/commonco/k8sorchestrator/topology.go:837\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).NodeGetInfo\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/node.go:429\ngithub.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6231\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/vendor/google.golang.org/grpc/server.go:922"}
...

Expected results:

Install vsphere ocp cluster succeed and the cluster storage operator is healthy

Additional info:

 

Description of problem:

When user changes vSphere configuration or credentials (username, password, vCenter address, ...), vsphere-problem-detector should re-check them quickly and not wait for a periodic re-check that can happen after 1 hour.

 

Version-Release number of selected component (if applicable): 4.13.0-0.nightly-2022-11-25-204445, but all previous versions are probably affected too.

How reproducible: Always

Steps to Reproduce:

  1. Install a cluster on vSphere (with valid credentials)
  2. Configure a bad username / password
  3. See that ClusterCSIDriver for vSphere CSI driver gets Degraded in ~2 minutes (that's vsphere-csi-driver-operator, it's quick)
  4. Wait until vsphere-problem-detector realizes it's a bad password (could take up to 1 hour):
    1. See that oc get storage -o yaml shows VSphereProblemDetectorControllerAvailable as "True" with message failed to connect to vcenter.XYZ: Cannot complete login due to an incorrect user name or password
    2. See that VSphereOpenshiftConnectionFailure alert is firing (or at least Pending)
  5. Configure correct username/password

Actual results:

It takes up to 1 hour for vsphere-problem-detector to re-check the password

Expected results:

vsphere-problem-detector re-checks the new password in few minutes (due to leader election it can't be instant). The alert + VSphereProblemDetectorControllerAvailable conditions are cleared in 5 minutes max.

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/72

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

add_host day2 does not work when grated edit permssion to other user.

Add host tab allowed but when we press on next which is enabled nothing happens.

 

How reproducible:

always

Steps to reproduce:

Logged in with user:  nshidlin-aiqe1 (org-admin role)

Create a cluster and installed successfully.Screencast from 2023-01-26 12-35-48.webm

Allocate edit permission to user:  nshidlin-aiqe1-u1 and nshidlin-aiqe1-u2

Logged out and loggin with nshidlin-aiqe1-u1 

Add host tab allowed but when we press on next which is enabled nothing happens.

 

Actual results:

next does nothing

Expected results:

Expecting to move to next window to download image

Description of problem:

'Filter by resource' drop-down menu items are in English.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

 

Steps to Reproduce:

1. Navigate to Developer -> Topology -> Filter by resource
2. 'DemonSet', 'Deployment' are in English
3.

Actual results:

Content is in English

Expected results:

Content should be in target language.

Additional info:

 

Currently the ConsoleNotification CR thats responsible for notifying user that the cluster is being upgraded has an accessibility issue where we are surfacing white text on yellow background. We should be using black text colour for in this case.

 

 

Description of problem:
On Openshift on Openstack CI, we are deploying an OCP cluster with an additional network on the workers in install-config.yaml for integration with Openstack Manila.

compute:
- name: worker
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['0eeae16f-bbc7-4e49-90b2-d96419b7c30d']
  replicas: 3

As a result, the egressIP annotation includes two interfaces definition:

$ oc get node ostest-hp9ld-worker-0-gdp5k -o json | jq -r '.metadata.annotations["cloud.network.openshift.io/egress-ipconfig"]' | jq .                                 
[
  {
    "interface": "207beb76-5476-4a05-b412-d0cc53ab00a7",
    "ifaddr": {
      "ipv4": "10.46.44.64/26"
    },
    "capacity": {
      "ip": 8
    }
  },
  {
    "interface": "2baf2232-87f7-4ad5-bd80-b6586de08435",
    "ifaddr": {
      "ipv4": "172.17.5.0/24"
    },
    "capacity": {
      "ip": 10
    }
  }
]

According to Huiran Wang, egressIP only works for primary interface on the node.

Version-Release number of selected component (if applicable):

 4.12.0-0.nightly-2022-11-22-012345
RHOS-16.1-RHEL-8-20220804.n.1

How reproducible:

Always

Steps to Reproduce:

Deploy cluster with additional Network on the workers

Actual results:

It is possible to select an egressIP network for a secondary interface

Expected results:

Only primary subnet can be chosen for egressIP

Additional info:

https://issues.redhat.com/browse/OCPQE-12968

Description of problem:

Reported by @dollierp in OCPBUGS-7293 /etc/resolv.conf has mode 0600

Version-Release number of selected component (if applicable):

4.13

How reproducible:

100% 

Steps to Reproduce:

1. Install a 4.13 cluster using RHCOS 9.2 on a baremetal network runtime platform
2. 
3.

Actual results:

/etc/resolv.conf is 0600
/etc/resolv.conf is system_u:object_r:tmp_t:s0

Expected results:

/etc/resolv.conf is 0644
/etc/resolv.conf is system_u:object_r:net_conf_t:s0

Additional info:

Unclear why this happens only on 9.x

Right now there is a list of machine cidrs in UI. Will be nice to automatically set one of them 

or at least in case there is only one just set it?

 

Eran Cohen

Description of problem:

This is the original bug: https://bugzilla.redhat.com/show_bug.cgi?id=2098054

It was fixed in https://github.com/openshift/kubernetes/pull/1340 but was reverted as it introduced a bug that meant we did not register instances on create for NLB services.

Need to fix the issue and reintroduce the fix

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Operand list page has strange layout when screen size is small. 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-20-104328

How reproducible:

Always

Steps to Reproduce:

1. Install Ansible Automation Platform operator
2. Go to one of its operands page
3. Make the screen size smaller

Actual results:

Radio button poorly aligned

Expected results:

Radio button and Title should be neatly aligned. 

Additional info:

On mobile devices, the Title will be missing >30%

Description of problem:

install 4.12 of IPv6 single stack disconnected cluster: etcd member is in abnormal status:

  1. oc get co|grep etcd
    etcd 4.12.0-0.nightly-2022-10-23-204408 False True True 15h EtcdMembersAvailable: 1 of 2 members are available, openshift-qe-057.arm.eng.rdu2.redhat.com is unhealthy

E1026 03:35:58.409977 1 etcdmemberscontroller.go:73] Unhealthy etcd member found: openshift-qe-057.arm.eng.rdu2.redhat.com, took=, err=create client failure: failed to make etcd client for endpoints https://[26xx:52:0:1eb:3xx3:5xx:fxxe:7550]:2379: context deadline exceeded

How reproducible:
not Always

Steps to Reproduce:
As description
Actual results:
As title
Expected results
etcd co stauts is normal

Description of problem:

On Pod definitions gathering, Operator should obfuscate particular environment variables (HTTP_PROXY and HTTPS_PROXY) from containers by default.

Pods from the control plane can have those variables injected from the cluster-wide proxy, and they may contain values as "user:password@[http://6.6.6.6:1234|http://6.6.6.6:1234/]".

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. In order to change deployments, scale down:
  * cluster-version-operator
  * cluster-monitoring-operator
  * prometheus-operator
2. Introduce a new environment variable on alertmanager-main statusSet with either or both HTTP_PROXY, HTTPS_PROXY. Any value but void will do.
4. Run insight-operator to get that pod definitions.
5. Check in the archive (usually config/pod/openshift-monitoring/alertmanager-main-0.json) that target environment variable(s) value is obfuscated.

Actual results:

...
"spec": {
    ...
    "containers": {
        ...
        "env": [
            {
                "name": "HTTP_PROXY"
                "value": "jdow:1qa2wd@[http://8.8.8.8:8080|http://8.8.8.8:8080/]"
            }
        ]
    }
}
...

Expected results:

...
"spec": {
    ...
    "containers": {
        ...
        "env": [
            {
                "name": "HTTP_PROXY"
                "value": "xxxxxxxxx" // Where x char number is the length of the obfuscated string
            }
        ]
    }
}
...

Additional info:

 

Description of the problem:
In the host dscovery page after selecting LVMO operator , the warning beside the host show:

Insufficient

    LVM requirements: ODF LVM requires at least one non-installation HDD/SSD disk on the host (minimum size: 0 GB). 

 
since the additional disk is mandatory for the LVMO , It is not clear for me how it's size is allowed to be very small (lets say 4MB)

Incase this is still correct and the minimum requirment is actually 0GB , i suggest modifying the warning to simple informing that an additional disk is required (remove the minimum 0 GB or change the text)

How reproducible:

 

Steps to reproduce:

1. create SNO Cluster

2. choose LVMO operator

3. in host discovery click on the insufficient link beside the host name

Note : after attaching a 4MB disk the lvmo requirement is met and i am able to continue with installation

Actual results:

 Text appear
Warning alert:Insufficient

LVM requirements: ODF LVM requires at least one non-installation HDD/SSD disk on the host (minimum size: 0 GB).

Expected results:

Incase minimum size not 0GB change to the right size
Incase minimum size is 0 GB change the text (remove the 0GB part)

This is a clone of issue OCPBUGS-8683. The following is the description of the original issue:

Clone of OCPBUGS-7906, but for all the other CSI drivers and operators than shared resource. All Pods / containers that are part of the OCP platform should run on dedicated "management" CPUs (if configured). I.e. they should have annotation 'target.workload.openshift.io/management:{"effect": "PreferredDuringScheduling"}' .

Enhancement: https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md

So far nobody ran our cloud CSI drivers with CPU pinning enabled, so this bug is a low prio. I checked LSO, it already has correct CPU pinning in all Pods, e.g. here.

Description of problem:


We added a line to increase debugging verbosity to aid in debugging WRKLDS-540

Version-Release number of selected component (if applicable):

13

How reproducible:

very

Steps to Reproduce:

1.just a revert
2.
3.

Actual results:

Extra debugging lines are present in the openshift-config-operator pod logs

Expected results:

Extra debugging lines no longer in the openshift-config-operator pod logs

Additional info:


Description of the problem:

Deployed AI on minikube with stable tag 

{
  "release_tag": "stable",
  "versions": {
    "assisted-installer": "quay.io/edge-infrastructure/assisted-installer:latest-304a0efaa2680562c0f613df1b788287df904531",
    "assisted-installer-controller": "quay.io/edge-infrastructure/assisted-installer-controller:latest-304a0efaa2680562c0f613df1b788287df904531",
    "assisted-installer-service": "quay.io/edge-infrastructure/assisted-service:latest-be2c5764fd0f2d90750d650aff267cda579fa4ef",
    "discovery-agent": "quay.io/edge-infrastructure/assisted-installer-agent:latest-5fa61e2d7ec70aedc6db48b8ee65e1e1117c4b73"
  }
}
 

Drop down list with OCP versions contains 2 X {OCP 4.12.0-rc.4 dev preview release}

Talking to Osher, this is the PR/reason for that https://github.com/openshift/assisted-service/pull/4795/files

 

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future

See OCPBUGS-11636 for additional information

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible in affected regions.

Steps to Reproduce:

1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.

Actual results:

./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output
2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private
2023/04/11 13:01:06 Copying signing key for use by installer
2023/04/11 13:01:07 Bucket abutchertestue2-oidc created
2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs
        status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=

Expected results:

"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.

Additional info:

 

AzureDisk CSI driver exposes metrics at 0.0.0.0:29604 as HTTP, we should expose them as HTTPS only.

The CSI driver already provides cmdline option --metrics-address to expose the metrics on loopback and we can use kube-rbac-proxy to add a public HTTPS endpoint in front of it.

This is a clone of issue OCPBUGS-8328. The following is the description of the original issue:

aws-ebs-csi-driver-operator ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

Description of problem:

Egress router POD creation on Openshift 4.11 is failing with below error.
~~~
Nov 15 21:51:29 pltocpwn03 hyperkube[3237]: E1115 21:51:29.467436    3237 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy_c965a287-28aa-47b6-9e79-0cc0e209fcf2_0(72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344): error adding pod stage-wfe-proxy_stage-wfe-proxy-ext-qrhjw to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw/c965a287-28aa-47b6-9e79-0cc0e209fcf2:openshift-sdn]: error adding container to network \\\"openshift-sdn\\\": CNI request failed with status 400: 'could not open netns \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": unknown FS magic on \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": 1021994\\n'\"" pod="stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw" podUID=c965a287-28aa-47b6-9e79-0cc0e209fcf2
~~~

I have checked SDN POD log from node where egress router POD is failing and I could see below error message.

~~~
2022-11-15T21:51:29.283002590Z W1115 21:51:29.282954  181720 pod.go:296] CNI_ADD stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw failed: could not open netns "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": unknown FS magic on "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": 1021994
~~~

Crio is logging below event and looking at the log it seems the namespace has been created on node.

~~~
Nov 15 21:51:29 pltocpwn03 crio[3150]: time="2022-11-15 21:51:29.307184956Z" level=info msg="Got pod network &{Name:stage-wfe-proxy-ext-qrhjw Namespace:stage-wfe-proxy ID:72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344 UID:c965a287-28aa-47b6-9e79-0cc0e209fcf2 NetNS:/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
~~~

Version-Release number of selected component (if applicable):

4.11.12

How reproducible:

Not Sure

Steps to Reproduce:

1.
2.
3.

Actual results:

Egress router POD is failing to create. Sample application could be created without any issue.

Expected results:

Egress router POD should get created

Additional info:

Egress router POD is created following below document and it does contain pod.network.openshift.io/assign-macvlan: "true" annotation.

https://docs.openshift.com/container-platform/4.11/networking/openshift_sdn/deploying-egress-router-layer3-redirection.html#nw-egress-router-pod_deploying-egress-router-layer3-redirection

Today we use telemeter-client to send telemetry home. Ideally we will migrate this to use prometheus native remote_write. This can already be configured, but there are scalabilty concerns on the Observatorium side.
It would however be an improvement to at least test this functionality with a CI test.

spec.nodeName can be an FQDN whereas HOSTNAME can be the short name. For example, on AWS:

[akaris@linux whereabouts-clusterbot]$ oc rsh -n openshift-multus whereabouts-reconciler-nq49z
sh-4.4# env | grep NAME
HOSTNAME=ip-10-0-252-190
OPENSHIFT_BUILD_NAME=multus-whereabouts-ipam-cni
OPENSHIFT_BUILD_NAMESPACE=ci-ln-d4sindb-1
[akaris@linux whereabouts-clusterbot]$ oc get pods -n openshift-multus whereabouts-reconciler-nq49z -o yaml | grep nodeName
  nodeName: ip-10-0-252-190.us-west-2.compute.internal

Meaning that this filter expression here fails:

whereabouts-cni/cmd/controlloop/controlloop.go
  135     ipPoolInformerFactory := wbinformers.NewSharedInformerFactory(wbClientSet, noResyncPeriod)                          
  136     netAttachDefInformerFactory := nadinformers.NewSharedInformerFactory(nadK8sClientSet, noResyncPeriod)               
  137     podInformerFactory := v1coreinformerfactory.NewSharedInformerFactoryWithOptions(                                    
  138         k8sClientSet, noResyncPeriod, v1coreinformerfactory.WithTweakListOptions(                                       
  139             func(options *v1.ListOptions) {                                                                             
  140                 const (                                                                                                 
  141                     filterKey           = "spec.nodeName"                                                               
  142                     hostnameEnvVariable = "HOSTNAME"                                                                    
  143                 )                                                                                                       
  144                 options.FieldSelector = fields.OneTermEqualSelector(filterKey, os.Getenv(hostnameEnvVariable)).String() 
  145             }))  

Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~

WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.

In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).

This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.

Version-Release number of selected component (if applicable): OCP 4.9

How reproducible: Occurred only twice(once in test and in current cluster)

Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty

Actual results: ClusterQuota should display the values

Expected results: ClusterQuota not displaying values

Description of the problem:

Attempting to install an SNO cluster with two day 2 workers fails when using ignitionEndpoint. After the initial spoke cluster installation, the AgentClusterInstall never reaches adding-hosts state but remains in installed. Also, the admin kubeconfig secret is not being generated on the hub cluster so the spoke clusters kubeapi cannot be accessed.

How reproducible:

100%

 

Steps to reproduce:

1. Install an SNO spoke cluster with ignitionEndpoint point to the API address

 

Actual results:

Cluster installs and remains in installed state but doesn't generate admin kubeconfig secret

Expected results:

Cluster reaches adding-hosts state and I can add day2 workers to the cluster

Description of problem:

The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.

Version-Release number of selected component (if applicable):

discovered in 4.10 -> 4.11 upgrade jobs

How reproducible:

The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.

Steps to Reproduce:

Inspect the CVO log and E2E logs from failing jobs with the provided check-cvo.py helper:

$ ./check-cvo.py cvo.log && echo PASS || echo FAIL

Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)

Actual results:

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL
FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333
FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251
FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521
FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604
FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531
Upgradeability checks:           5
Upgradeability check cache hits: 12
FAIL

Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.

Expected result

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL
Upgradeability checks:           12
Upgradeability check cache hits: 11
PASS

The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.

Additional info:

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go
...
I1227 06:50:59.023190       1 upgradeable.go:122] Cluster current version=4.10.46
I1227 06:50:59.042735       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:51:14.024345       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:53:23.080768       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:56:59.366010       1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12'
Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.

I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.

Some context in Slack thread

Description of problem:

"create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure

Version-Release number of selected component (if applicable):

$ ./openshift-install version
./openshift-install 4.13.0-0.nightly-2023-01-27-165107
built from commit fca41376abe654a9124f0450727579bb85591438
release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. "create manifests"
2. "create cluster" 

Actual results:

1. After "create manifests", in "<install dir>/openshift", there're 4 YAML files missing, including "99_cloud-creds-secret.yaml", "99_kubeadmin-password-secret.yaml", "99_role-cloud-creds-secret-reader.yaml", and "openshift-install-manifests.yaml", comparing with "create manifests" with an existing "install-config.yaml".
2. The installation failed without any worker nodes due to error getting credentials secret "gcp-cloud-credentials" in namespace "openshift-machine-api".

Expected results:

1. "create manifests" without an existing "install-config.yaml" should generate the same set of YAML files as "create manifests" with an existing "install-config.yaml".
2. Then the subsequent "create cluster" should succeed.

Additional info:

The working scenario: "create manifests" with an existing "install-config.yaml"

$ ./openshift-install version
./openshift-install 4.13.0-0.nightly-2023-01-27-165107
built from commit fca41376abe654a9124f0450727579bb85591438
release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529
release architecture amd64
$ 
$ mkdir test30
$ cp install-config.yaml test30
$ yq-3.3.0 r test30/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
$ yq-3.3.0 r test30/install-config.yaml metadata
creationTimestamp: null
name: jiwei-0130a
$ ./openshift-install create manifests --dir test30
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
INFO Consuming Install Config from target directory 
WARNING Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated 
INFO Manifests created in: test30/manifests and test30/openshift 
$ 
$ tree test30
test30
├── manifests
│   ├── cloud-controller-uid-config.yml
│   ├── cloud-provider-config.yaml
│   ├── cluster-config.yaml
│   ├── cluster-dns-02-config.yml
│   ├── cluster-infrastructure-02-config.yml
│   ├── cluster-ingress-02-config.yml
│   ├── cluster-network-01-crd.yml
│   ├── cluster-network-02-config.yml
│   ├── cluster-proxy-01-config.yaml
│   ├── cluster-scheduler-02-config.yml
│   ├── cvo-overrides.yaml
│   ├── kube-cloud-config.yaml  
│   ├── kube-system-configmap-root-ca.yaml
│   ├── machine-config-server-tls-secret.yaml
│   └── openshift-config-secret-pull-secret.yaml
└── openshift
    ├── 99_cloud-creds-secret.yaml
    ├── 99_kubeadmin-password-secret.yaml
    ├── 99_openshift-cluster-api_master-machines-0.yaml
    ├── 99_openshift-cluster-api_master-machines-1.yaml
    ├── 99_openshift-cluster-api_master-machines-2.yaml
    ├── 99_openshift-cluster-api_master-user-data-secret.yaml
    ├── 99_openshift-cluster-api_worker-machineset-0.yaml
    ├── 99_openshift-cluster-api_worker-machineset-1.yaml
    ├── 99_openshift-cluster-api_worker-machineset-2.yaml
    ├── 99_openshift-cluster-api_worker-machineset-3.yaml
    ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
    ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml
    ├── 99_openshift-machineconfig_99-master-ssh.yaml
    ├── 99_openshift-machineconfig_99-worker-ssh.yaml
    ├── 99_role-cloud-creds-secret-reader.yaml
    └── openshift-install-manifests.yaml2 directories, 31 files
$ 

The problem scenario: "create manifests" without an existing "install-config.yaml", and then "create cluster"

$ ./openshift-install create manifests --dir test31
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
? Platform gcp
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
? Project ID OpenShift QE (openshift-qe)
? Region us-central1
? Base Domain qe.gcp.devcluster.openshift.com
? Cluster Name jiwei-0130b
? Pull Secret [? for help] *******
INFO Manifests created in: test31/manifests and test31/openshift
$ 
$ tree test31
test31
├── manifests
│   ├── cloud-controller-uid-config.yml
│   ├── cloud-provider-config.yaml
│   ├── cluster-config.yaml
│   ├── cluster-dns-02-config.yml
│   ├── cluster-infrastructure-02-config.yml
│   ├── cluster-ingress-02-config.yml
│   ├── cluster-network-01-crd.yml
│   ├── cluster-network-02-config.yml
│   ├── cluster-proxy-01-config.yaml
│   ├── cluster-scheduler-02-config.yml
│   ├── cvo-overrides.yaml
│   ├── kube-cloud-config.yaml
│   ├── kube-system-configmap-root-ca.yaml
│   ├── machine-config-server-tls-secret.yaml
│   └── openshift-config-secret-pull-secret.yaml
└── openshift
    ├── 99_openshift-cluster-api_master-machines-0.yaml
    ├── 99_openshift-cluster-api_master-machines-1.yaml
    ├── 99_openshift-cluster-api_master-machines-2.yaml
    ├── 99_openshift-cluster-api_master-user-data-secret.yaml
    ├── 99_openshift-cluster-api_worker-machineset-0.yaml
    ├── 99_openshift-cluster-api_worker-machineset-1.yaml
    ├── 99_openshift-cluster-api_worker-machineset-2.yaml
    ├── 99_openshift-cluster-api_worker-machineset-3.yaml
    ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
    ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml
    ├── 99_openshift-machineconfig_99-master-ssh.yaml
    └── 99_openshift-machineconfig_99-worker-ssh.yaml2 directories, 27 files
$ 
$ ./openshift-install create cluster --dir test31
INFO Consuming Common Manifests from target directory
INFO Consuming Openshift Manifests from target directory
INFO Consuming Master Machines from target directory
INFO Consuming Worker Machines from target directory
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 4:17PM) for the Kubernetes API at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443...
INFO API v1.25.2+7dab57f up
INFO Waiting up to 30m0s (until 4:28PM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 4:59PM) for the cluster at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443 to initialize...
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded:
ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address
ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_ResourceNotFound::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found
ERROR OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
ERROR Cluster operator cloud-credential Degraded is True with CredentialsFailing: 7 of 7 credentials requests are failing to sync.
INFO Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 7 credentials requests provisioned, 7 reporting errors.
ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready
ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console
ERROR RouteHealthDegraded: console route is not admitted
ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console
ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted 
ERROR Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s)
ERROR Cluster operator control-plane-machine-set Degraded is True with NoReadyMachines: No ready control plane machines found
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
ERROR Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist
ERROR NodeCADaemonAvailable: The daemon set node-ca has available replicas
ERROR ImagePrunerAvailable: Pruner CronJob has been created
INFO Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found
INFO NodeCADaemonProgressing: The daemon set node-ca is deployed
ERROR Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist
ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DNSReady=False (NoZones: The record isn't present in any zones.)
INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available...
INFO ).
INFO Not all ingress controllers are available.
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-c68b5786c-prk7x" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Pod "router-default-c68b5786c-ssrv7" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DNSReady=False (NoZones: The record isn't present in any zones.), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret
ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host  
INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107
ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 because minimum worker replica count (2) not yet met: current running replicas 0, waiting for [jiwei-0130b-25fcm-worker-a-j6t42 jiwei-0130b-25fcm-worker-b-dpw9b jiwei-0130b-25fcm-worker-c-9cdms]
ERROR Cluster operator machine-api Available is False with Initializing: Operator is initializing
ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
INFO Cluster operator network ManagementStateDegraded is False with :
INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready
INFO Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
ERROR Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
ERROR failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, image-registry, ingress, machine-api, monitoring, storage are not available
$ export KUBECONFIG=test31/auth/kubeconfig 
$ ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          74m     Unable to apply 4.13.0-0.nightly-2023-01-27-165107: some cluster operators are not available
$ ./oc get nodes
NAME                                                 STATUS   ROLES                  AGE   VERSION
jiwei-0130b-25fcm-master-0.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
jiwei-0130b-25fcm-master-1.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
jiwei-0130b-25fcm-master-2.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
$ ./oc get machines -n openshift-machine-api
NAME                               PHASE   TYPE   REGION   ZONE   AGE
jiwei-0130b-25fcm-master-0                                        73m
jiwei-0130b-25fcm-master-1                                        73m
jiwei-0130b-25fcm-master-2                                        73m
jiwei-0130b-25fcm-worker-a-j6t42                                  65m
jiwei-0130b-25fcm-worker-b-dpw9b                                  65m
jiwei-0130b-25fcm-worker-c-9cdms                                  65m
$ ./oc get controlplanemachinesets -n openshift-machine-api
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3                           3             Active   74m
$ 

Please see the attached ".openshift_install.log", install-config.yaml snippet, and more "oc" commands outputs.

 

 

 

 

 

This is a clone of issue OCPBUGS-9685. The following is the description of the original issue:

The aggregated https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-rt-upgrade-4.14-minor-release-openshift-release-analysis-aggregator/1633554110798106624 job failed.  Digging into one of them:

 

This MCD log has https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1633554106595414016/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-p2vf4_machine-config-daemon.log

 

Deployments:
* ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                   Digest: sha256:4f28fbcd049025bab9719379492420f9eaab0426cdbbba43b395eb8421f10a17
                  Version: 413.86.202302230536-0 (2023-03-08T20:10:47Z)
      RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-372.43.1.el8_6
          LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                           kernel-rt-modules-extra
...
E0308 22:11:21.925030 74176 writer.go:200] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd299b2bf3cc98fb70907f152b4281633064fe33527b5d6a42ddc418ff00eec1: error: Importing: remote error: fetching blob: received unexpected HTTP status: 500 Internal Server Error
... 
I0308 22:11:36.959143   74176 update.go:2010] Running: rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra
...
E0308 22:12:35.525156   74176 writer.go:200] Marking Degraded due to: error running rpm-ostree override reset kernel kernel-core kernel-modules kernel-modules-extra --uninstall kernel-rt-core --uninstall kernel-rt-kvm --uninstall kernel-rt-modules --uninstall kernel-rt-modules-extra: error: Package/capability 'kernel-rt-core' is not currently requested
: exit status 1
  

 

Something is going wrong here in our retry loop.   I think it might be that we don't clear the pending deployment on failure.  IOW we need to

rpm-ostree cleanup -p 

before we rertry.

 

This is fallout from https://github.com/openshift/machine-config-operator/pull/3580 - Although I suspect it may have been an issue before too.

 

This is a clone of issue OCPBUGS-10950. The following is the description of the original issue:

Description of problem: 

"pipelines-as-code-pipelinerun-go" configMap is not been used for the Go repository while creating Pipeline Repository. "pipelines-as-code-pipelinerun-generic" configMap has been used.

Prerequisites (if any, like setup, operators/versions):

Install Red Hat Pipeline operator

Steps to Reproduce

  1. Navigate to Create Repository form 
  2. Enter the Git URL `https://github.com/vikram-raj/hello-func-go`
  3. Click on Add

Actual results:

`pipelines-as-code-pipelinerun-generic` PipelineRun template has been shown on the overview page 

Expected results:

`pipelines-as-code-pipelinerun-go` PipelineRun template should show on the overview page

Reproducibility (Always/Intermittent/Only Once):

Build Details:

4.13

Workaround:

Additional info:

This is a clone of issue OCPBUGS-11389. The following is the description of the original issue:

Description of problem:

In certain cases, an AWS cluster running 4.12 doesn't automatically generate a controlplanemachineset when it's expected to.

It looks like CPMS is looking for `infrastructure.Spec.PlatformSpec.Type` (https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/2aeaaf9ec714ee75f933051c21a44f648d6ed42b/pkg/controllers/controlplanemachinesetgenerator/controller.go#L180) and as result, clusters born earlier than 4.5 when this field was introduced (https://github.com/openshift/installer/pull/3277) will not be able to generate a CPMS.

I believe we should be looking at `infrastructure.Status.PlatformStatus.Type` instead

Version-Release number of selected component (if applicable):

4.12.9

How reproducible:

Consistent

Steps to Reproduce:

1. Install a cluster on a version earlier than 4.5
2. Upgrade cluster through to 4.12
3. Observe "Unable to generate control plane machine set, unsupported platform" error message from the control-plane-machine-set-operator, as well as the missing CPMS object in the openshift-machine-api namespace

Actual results:

No generated CPMS is created, despite the platform being AWS

Expected results:

A generated CPMS existing in the openshift-machine-api namespace

Additional info:


Description of problem:

Customer needs "IfNotPresent" ImagePullPolicy set for bundle unpacker images which reference iamges by digest. Currently, policy is set to "Always" no matter what.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Install an operator via bundle referencing an image by digest
2.Check the bundle unpacker pod

Actual results:

Image pull policy will be set to "Always"

Expected results:

Image pull policy will be set to "IfNotPresent" when pulling via digest

Additional info:

 

Description of problem:

Version-Release number of selected component (if applicable):

  1. 4.12.0-0.nightly-2023-01-31-232828: Works fine
  2. 4.13.0-ec.1 from 2022-12-08 18:31:25: There was a crash because console.pinnedResource was null, this is fixed as part of OCPBUGS-6831
  3. 4.13.0-ec.2 from 2023-01-20 13:19:32: Works fine
  4. 4.13.0-0.nightly-2023-01-24-061922: Never stops loading
  5. 4.13.0-0.nightly-2023-01-31-072358: Never stops loading

How reproducible:
Always

Steps to Reproduce:

  1. Setup a new cluster (via cluster bot)
  2. Create a user with limited access
  3. Login with this user

Actual results:
Console navigation loads and the content area shows a loading indicator that doesn't disappear.

Expected results:
Console should work and load also for normal users.

Additional info:

Provide a migration path to allow users to switch the Image Registry backend from a persistent volume to Swift. Swift allows for concurrent access, thus redundancy of image-registry, and is the recommended IR backend.

In case Swift becomes available in existing clusters, or in case the installation for incidental reasons failed over to Cinder, users must be able to start using Swift as a day-2 operation, while recovering images they may have pushed to the volume-backed registry up to that point.

Description of the problem:

In Staging, BE v2.12 - new cluster on staging with static IP (Dual stack). In the networking part, i get "Belongs to majority connected group" validation error on all hosts. I've checked hosts and verified i can ping (v4 and v6) other hosts (from one of the hosts)

How reproducible:

100%

Steps to reproduce:

1. Create new static IP Dual stack cluster 

2. in the networking page - get belongs to majority group validation error

3.

Actual results:

 

Expected results:

 

Description of the problem:
When a user overrides the discovery ignition all input is accepted even if the image download will later fail.
 

How reproducible:
100%
 

Steps to reproduce:

1. Submit a large ignition override

2. Download ISO

Actual results:
ISO download fails
 

Expected results:
Ignition override fails

The reserved area for the discovery ISO ignition is 256Kib
The ignition is embedded as a gzipped cpio archive created here https://github.com/openshift/assisted-image-service/blob/568b59ec570d2b2ed15f5070bb29697fc01c9525/pkg/isoeditor/ignition.go#L15-L41.

Assisted service should create the archive (by importing the referenced function from the image-service) and check the resulting size to validate that it's not too large. It's likely that the reserved area for the ignition is not going to change so we can just put that as a constant somewhere in assisted-service.

Note that creating the archive will increase memory usage, but will probably not be a large problem as users can already submit an arbitrarily large archive so even if this validation ends up using 2-3x of the override size we shouldn't be too concerned.

Description of problem:

OCP 4.12 deployments making use of secondary bridge br-ex1 for CNI fail to start ovs-configuration service, with multiple failures.

Version-Release number of selected component (if applicable):

Openshift 4.12.0-rc.0 (2022-11-10)

How reproducible:

Until now always at least one node out of four workers fails, not always the same node, sometimes several nodes.

Steps to Reproduce:

1. Preparing to configure ipi on the provisioning node
   - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)

2. configuring the install-config.yaml (attached)
   - provisioningNetwork: enabled
   - machine network: single stack ipv4
   - disconnected installation
   - ovn-kubernetes with hybrid-networking setup
   - LACP bonding setup using MC manifests at day1
     * bond0 -> baremetal 192.168.32.0/24 (br-ex)
     * bond0.662  -> interface for secondary bridge (br-ex1) 192.168.66.128/26
   - secondary bridge defined in /etc/ovnk/extra_bridge using MC Manifest
   
3. deploy the cluster
- Usually the deployment is completed
- Nodes show Ready status, but in some nodes ovs-configuration fails
- Consequent MC changes fail because MCP cannot roll out configurations in nodes with the failure.

NOTE: This impacts testing of our partners Verizon and F5, because we are validating their CNFs before OCP 4.12 release and we need a secondary bridge for CNI.

Actual results:

br-ex1 and all its related ovs-ports and interfaces fail to activate, ovs-configuration service fails. 

Expected results:

br-ex1 and all its related ovs-ports and interfaces succeed to activate, ovs-configuration service starts successfully. 

Additional info:
1. Nodes and MCP info

$ oc get nodes
NAME       STATUS   ROLES                  AGE     VERSION
master-0   Ready    control-plane,master   7h59m   v1.25.2+f33d98e
master-1   Ready    control-plane,master   7h59m   v1.25.2+f33d98e
master-2   Ready    control-plane,master   8h      v1.25.2+f33d98e
worker-0   Ready    worker                 7h26m   v1.25.2+f33d98e
worker-1   Ready    worker                 7h25m   v1.25.2+f33d98e
worker-2   Ready    worker                 7h25m   v1.25.2+f33d98e
worker-3   Ready    worker                 7h25m   v1.25.2+f33d98e
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE                         
master   rendered-master-210a69a0b40162b2f349ea3a5b5819e5   True      False      False      3              3                   3                     0                      7h57m                       
worker   rendered-worker-e8a62c86ce16e98e45e3166847484cf0   False     True       True       4              2                   2                     1                      7h57m 

2. When logging it to the nodes via SSH, we see when ovs-configuration fails, and from the ovs-configuration service logs, we see the following error: (full log attached worker-0-ovs-configuration.log)

$ ssh core@worker-0
---
Last login: Sat Nov 12 21:33:58 2022 from 192.168.62.10
[systemd]
Failed Units: 3
  NetworkManager-wait-online.service
  ovs-configuration.service
  stalld.service

[core@worker-0 ~]$ sudo journalctl -u ovs-configuration | less
...
Nov 12 15:27:54 worker-0 configure-ovs.sh[8237]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == vlan ']'
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 178: [: ==: unary operator expected
Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: ++ nmcli --get-values connection.type conn show
Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == bond ']'
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 191: [: ==: unary operator expected
Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: ++ nmcli --get-values connection.type conn show
Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == team ']'
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 203: [: ==: unary operator expected
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + iface_type=802-3-ethernet
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' '!' '' = 0 ']'

3. We observe the failed node (worker-0) has ovs-if-phys1 connection as an ethernet type. But a working node (worker-1) shows a vlan type for the same connection with the vlan info

[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection                                                                                                            
[connection]
id=ovs-if-phys1
uuid=aea14dc9-2d0c-4320-9c13-ddf3e64747bf
type=ethernet
autoconnect=false
autoconnect-priority=100
autoconnect-slaves=1
interface-name=bond0.662
master=e61c56f7-f3ba-40f7-a1c1-37921fc6c815
slave-type=ovs-port

[ethernet]
cloned-mac-address=B8:83:03:91:C5:2C
mtu=1500

[ovs-interface]
type=system

[core@worker-1 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection
[connection]
id=ovs-if-phys1
uuid=9a019885-3cc1-4961-9dfa-6b7f996556c4
type=vlan
autoconnect-priority=100
autoconnect-slaves=1
interface-name=bond0.662
master=877acf53-87d7-4cdf-a078-000af4f962c3
slave-type=ovs-port
timestamp=1668265640

[ethernet]
cloned-mac-address=B8:83:03:91:C5:E8
mtu=9000

[ovs-interface]
type=system

[vlan]
flags=1
id=662
parent=bond0

4. Another problem we observe is that we specifically disable IPv6 in the the bond0.662 connection, but the generated connection for br-ex1 has ipv6 method-auto, and it should be disabled.

[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/bond0.662.nmconnection 
[connection]
id=bond0.662
type=vlan
interface-name=bond0.662
autoconnect=true
autoconnect-priority=99

[vlan]
parent=bond0
id=662

[ethernet]
mtu=9000

[ipv4]
method=auto
dhcp-timeout=2147483647
never-default=true

[ipv6]
method=disabled
never-default=true

[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/br-ex1.nmconnection
[connection]
id=br-ex1
uuid=df67dcd9-4263-4707-9abc-eda16e75ea0d
type=ovs-bridge
autoconnect=false
autoconnect-slaves=1
interface-name=br-ex1

[ethernet]
mtu=1500

[ovs-bridge]

[ipv4]
method=auto

[ipv6]
addr-gen-mode=stable-privacy
method=auto

[proxy]

5. All journals, must-gather, some deployment files can be found in our CI console (Login with RedHat SSO) https://www.distributed-ci.io/jobs/46459571-900f-43df-8798-d36b322d26f4/files
But attached some of the logs to facilitate the task, worker-0 files are from the node with issues with ovs, worker-1 are from a worker that is OK in case you want to compare.

11_master-bonding.yaml
11_worker-bonding.yaml
install-config.yaml
journal-worker-0.log
journal-worker-1.log
must_gather.tar.gz
sosreport-worker-0-2022-11-12-csbyqfe.tar.xz
sosreport-worker-1-2022-11-12-ubltjdn.tar.xz
worker-0-ip-nmcli-info.log
worker-0-ovs-configuration.log
worker-1-ip-nmcli-info.log
worker-1-ovs-configuration.log

Please let us know if you need any additional information.

Description of problem:

`/etc/hostname` may exist, but be empty. `vsphere-hostname` service should check that the file is not empty instead of just that it exists.

OKD's machine-os-content starting from F37 has an empty /etc/hostname file, which breaks joining workers in vsphere IPI

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Install OKD w/ workers on vsphere
2.
3.

Actual results:


Workers get hostname resolved using NM

Expected results:


Workers get hostname resolved using vmtoolsd

Additional info:


This is a clone of issue OCPBUGS-10351. The following is the description of the original issue:

Forked off from OCPBUGS-8038

From the must-gather in https://issues.redhat.com/browse/OCPBUGS-8038?focusedId=21912866&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21912866 we could find the following logs:

2023-03-14T21:17:11.465797715Z I0314 21:17:11.465697       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379"  ID:13492066955251100765 name:"ip-10-0-1-207.ec2.internal" peerURLs:"https://10.0.1.207:2380" clientURLs:"https://10.0.1.207:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.207:{} 10.0.1.84:{} 10.0.101.19:{}]
2023-03-14T21:17:11.465797715Z I0314 21:17:11.465758       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine mdtest-d7vwd-master-0 since its member is still present with any of: [{InternalIP 10.0.1.207} {InternalDNS ip-10-0-1-207.ec2.internal} {Hostname ip-10-0-1-207.ec2.internal}]
2023-03-14T21:17:13.859516308Z I0314 21:17:13.859419       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
2023-03-14T21:17:23.870877844Z I0314 21:17:23.870837       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine mdtest-d7vwd-master-0
2023-03-14T21:17:23.875474696Z I0314 21:17:23.875400       1 machinedeletionhooks.go:137] current members [ID:3518515843762260966 name:"etcd-bootstrap" peerURLs:"https://10.0.101.19:2380" clientURLs:"https://10.0.101.19:2379"  ID:8589017757839934213 name:"ip-10-0-1-84.ec2.internal" peerURLs:"https://10.0.1.84:2380" clientURLs:"https://10.0.1.84:2379"  ID:10724356314977705432 name:"ip-10-0-1-133.ec2.internal" peerURLs:"https://10.0.1.133:2380" clientURLs:"https://10.0.1.133:2379" ] with IPSet: map[10.0.1.133:{} 10.0.1.84:{} 10.0.101.19:{}]
2023-03-14T21:17:27.426349565Z I0314 21:17:27.424701       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0
2023-03-14T21:17:31.431703982Z I0314 21:17:31.431615       1 machinedeletionhooks.go:222] successfully removed the guard pod from machine mdtest-d7vwd-master-0

At the same time, we were roughly finishing the bootstrap process:

2023-03-14T21:17:11.510890775Z W0314 21:17:11.510850       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found
2023-03-14T21:17:12.736741689Z W0314 21:17:12.736697       1 bootstrap_teardown_controller.go:140] cluster-bootstrap is not yet finished - ConfigMap 'kube-system/bootstrap' not found

Which ended up with only nodes (bootstrap + master-0) being teared down, leaving just barely a quorum of two.

The CPMSO was trying to re-create the master-0 during installation due to a label change, that caused the CEO to be fairly confused about what it is doing during the installation process. Helpful timeline from Trevor: https://docs.google.com/document/d/1o9hJT-M4HSbGbHMm5n-LjQlwVKeKfF3Ln5_ruAXWr5I/edit#heading=h.hfowu6nlc7em

Description of problem:

Updating ose-installer-artifacts, os-baremetal-installer and ose-installer images to be consistent with ART. Not a bug per se, just updating the installer images to use Golang-1.19.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

This bug is needed so the changes can be cherry-picked to 4.12 as intended by
https://github.com/openshift/installer/pull/6449
https://github.com/openshift/installer/pull/6448
https://github.com/openshift/installer/pull/6447

This is a clone of issue OCPBUGS-6727. The following is the description of the original issue:

Description of problem:

When creating an OCP cluster with Nutanix infrastructure and using DHCP instead of IPAM network config, the Hostname of the VM is not set by DHCP. In these case we need to inject the desired hostname through cloud-init for both control-plane and worker nodes.

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible when creating an OCP cluster with Nutanix infrastructure and using DHCP instead of IPAM network config.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-9907. The following is the description of the original issue:

Description of problem:

The alerts table displays incorrect values (Prometheus) in the source column 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Install LokiOperator, Cluster Logging operator and enable the logging view plugin with the alerts feature toggle enabled
2. Add a log-based alert
3. Check the alerts table source in the observe -> alerts section

Actual results:

Incorrect "Prometheus" value is displayed for non log-based alerts

Expected results:

"Platform" or "User" value is displayed for non log-based alerts

Additional info:

 

Description of problem:

Since coreos-installer writes to stdout, its logs are not available for us.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

In DeploymentConfig both the Form view and Yaml view are not in sync

Version-Release number of selected component (if applicable):

4.11.13

How reproducible:

Always

Steps to Reproduce:

1. Create a DC with selector and labels as given below
spec:
  replicas: 1
  selector:
    app: apigateway
    deploymentconfig: qa-apigateway
    environment: qa
  strategy:
    activeDeadlineSeconds: 21600
    resources: {}
    rollingParams:
      intervalSeconds: 1
      maxSurge: 25%
      maxUnavailable: 25%
      timeoutSeconds: 600
      updatePeriodSeconds: 1
    type: Rolling
  template:
    metadata:
      labels:
        app: apigateway
        deploymentconfig: qa-apigateway
        environment: qa

2. Now go to GUI--> Workloads--> DeploymentConfig --> Actions--> Edit DeploymentConfig, first go to Form view and now switch to Yaml view, the selector and labels shows as app: ubi8 while it should display app: apigateway

  selector:
    app: ubi8
    deploymentconfig: qa-apigateway
    environment: qa
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: ubi8
        deploymentconfig: qa-apigateway
        environment: qa

3. Now in yaml view just click reload and the value is displayed as it is when it was created (app: apigateway).

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

Users with /18 subnets with VIPs chosen in the second half of those subnets get a validation error saying the VIP is already in use

How reproducible:

Unknown, one partner is hitting this (cluster ID 9f1a0b8d-b1b9-4821-9272-c3cf81728896)
Steps to reproduce:

IDK

Actual results:

Validation should succeed

Expected results:
Validation fails

More information:

It seems like nmap simply scans only the first half of the subnet (8092 addresses) and ignores the second half. This causes the report of "free addresses" to only contain half the addresses, which leads the service to conclude that the second half is in use, while it's not really in use. If the VIP lands in that second half, the service will complain that the VIP is in use and block the installation from proceeding 

Currently, the service asks the agent for all free addresses in a list of CIDRs, up to 8000.  The service stores all of these addresses in the DB and uses it to check if the VIPs are in use.

This is good because the validation can run almost immediately after the user enters the VIPs.  However, it doesn't scale well with big subnets.  We have recently seen a case with more than 8000 addresses, such that the validation doesn't work and we assume the VIPs are not in use.  Additionally, big subnets involve transferring and storing large amounts of data.

We can improve the scale by having the service provide the agent with a list of VIPs to check, and the agent will report on whether or not each one is free.  This is much more scalable but has the drawback of needing to wait around 1 minute for the validation to run due to polling.  We can consider a hybrid approach where we use the current approach for small subnets.

This is a clone of issue OCPBUGS-103. The following is the description of the original issue:

Description of problem:
When "Service Binding Operator" is successfully installed in the cluster for the first time, the page will automatically redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" " 

Notice: This issue only happened when the user installed "Service Binding Operator" for the first time. If the user uninstalls and re-installs the operator again, this issue will be gone 

Version-Release number of selected components (if applicable):
4.12.0-0.nightly-2022-08-12-053438

How reproducible:
Always

Steps to Reproduce:

  1. Login to OCP web console. Go to Operators -> OperatorHub page
  2. Install "Service Binding Operator", wait until finish, check the page
  3.  

Actual results:
The page will redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" " 
 
Expected results:
The page should stay on the install page, with the message "Installed operator- ready for use"

Additional info:

Please find the attached snap for more details 

Description of problem:

Customer has noticed that object count quotas ("count/*") do not work for certain objects in ClusterResourceQuotas. For example, the following ResourceQuota works as expected:

~~~
apiVersion: v1
kind: ResourceQuota
metadata:
[..]
spec:
  hard:
    count/routes.route.openshift.io: "900"
    count/servicemonitors.monitoring.coreos.com: "100"
    pods: "100"
status:
  hard:
    count/routes.route.openshift.io: "900"
    count/servicemonitors.monitoring.coreos.com: "100"
    pods: "100"
  used:
    count/routes.route.openshift.io: "0"
    count/servicemonitors.monitoring.coreos.com: "1"
    pods: "4"
~~~

However when using "count/servicemonitors.monitoring.coreos.com" in ClusterResourceQuotas, this does not work (note the missing "used"):

~~~
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
[..]
spec:
  quota:
    hard:
      count/routes.route.openshift.io: "900"
      count/servicemonitors.monitoring.coreos.com: "100"
      count/simon.krenger.ch: "100"
      pods: "100"
  selector:
    annotations:
      openshift.io/requester: kube:admin
status:
  namespaces:
[..]
  total:
    hard:
      count/routes.route.openshift.io: "900"
      count/servicemonitors.monitoring.coreos.com: "100"
      count/simon.krenger.ch: "100"
      pods: "100"
    used:
      count/routes.route.openshift.io: "0"
      pods: "4"
~~~

This behaviour does not only apply to "servicemonitors.monitoring.coreos.com" objects, but also to other objects, such as:

- count/kafkas.kafka.strimzi.io: '0' - count/prometheusrules.monitoring.coreos.com: '100' - count/servicemonitors.monitoring.coreos.com: '100' 

The debug output for kube-controller-manager shows the following entries, which may or may not be related:

~~~
$ oc logs kube-controller-manager-ip-10-0-132-228.eu-west-1.compute.internal | grep "servicemonitor" I0511 15:07:17.297620 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.297630 1 resource_quota_monitor.go:181] QuotaMonitor using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors" I0511 15:07:17.297642 1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for servicemonitors.monitoring.coreos.com [..] I0511 15:07:17.486279 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.486297 1 graph_builder.go:176] using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors", kind "monitoring.coreos.com/v1, Kind=ServiceMonitor" ~~~

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.15

How reproducible:

Always

Steps to Reproduce:

1. On an OCP 4.12 cluster, create the following ClusterResourceQuota:

~~~
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
  name: case-03509174
spec:
  quota: 
    hard:
      count/servicemonitors.monitoring.coreos.com: "100"
      pods: "100"
  selector:
    annotations: 
      openshift.io/requester: "kube:admin"
~~~

2. As "kubeadmin", create a new project and deploy one new ServiceMonitor, for example: 

~~~
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: simon-servicemon-2
  namespace: simon-1
spec:
  endpoints:
    - path: /metrics
      port: http
      scheme: http
  jobLabel: component
  selector:
    matchLabels:
      deployment: echoenv-1
~~~

Actual results:

The "used" field for ServiceMonitors is not populated in the ClusterResourceQuota for certain objects. It is unclear if these quotas are enforced or not

Expected results:

ClusterResourceQuota for ServiceMonitors is updated and enforced

Additional info:

* Must-gather for a cluster showing this behaviour (added debug for kube-controller-manager) is available here: https://drive.google.com/file/d/1ioEEHZQVHG46vIzDdNm6pwiTjkL9QQRE/view?usp=share_link
* Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1683876047243989

Description of problem:

Cluster ingress operator creates router deployments with affinity rules when running in a cluster with non-HA infrastructure plane (InfrastructureTopology=="SingleReplica") and "NodePortService" endpoint publishing strategy. With only one worker node available, rolling update of router-default stalls.

Version-Release number of selected component (if applicable):

All

How reproducible:

Create a single worker node cluster with "NodePortService" endpoint publishing strategy and try to restart the default router. Restart will not go through.

Steps to Reproduce:

1. Create a single worker node OCP cluster with HA control plane (ControlPlaneTopology=="HighlyAvailable"/"External") and one worker node (InfrastructureTopology=="SingleReplica") using "NodePortService" endpoint publishing strategy. The operator will create "ingress-default" deployment with "podAntiAffinity" block, even though the number of nodes where ingress pods can be scheduled is only one:
```
apiVersion: apps/v1
kind: Deployment
metadata:
  ...
  name: router-default
  namespace: openshift-ingress
  ...
spec:
  ...
  replicas: 1
  ...
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 50%
    type: RollingUpdate
  template:
    ...
    spec:
      affinity:
        ...
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: ingresscontroller.operator.openshift.io/deployment-ingresscontroller
                operator: In
                values:
                - default
              - key: ingresscontroller.operator.openshift.io/hash
                operator: In
                values:
                - 559d6c97f4
            topologyKey: kubernetes.io/hostname
...
```

2. Restart the default router

```
oc rollout restart deployment router-default -n openshift-ingress
```
 

Actual results:

Deployment restart does not complete and hangs forever:

```
oc get po -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-58d88f8bf6-cxnjk   0/1     Pending   0          2s
router-default-5bb8c8985b-kdg92   1/1     Running   0          2d23h
```

Expected results:

Deployment restart completes

Additional info:

 

Description of problem:

Pod in the openshift-marketplace cause PodSecurityViolation alerts in vanilla OpenShift cluster

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-04-203333

How reproducible:

100%

Steps to Reproduce:

1. install a freshly new cluster
2. check the alerts in the console

Actual results:

PodSecurityViolation alert is present

Expected results:

No alerts

Additional info:

I'll provide a filtered version of the audit logs containing the violations

This is a clone of issue OCPBUGS-10910. The following is the description of the original issue:

Description of problem:

The network-tools image stream is missing in the cluster samples. It is needed for CI tests.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

E2E test cases for knative and pipeline packages have been disabled on CI due to respective operator installation issues. 
Tests have to be enabled after new operator version be available or the issue resolves

References:
https://coreos.slack.com/archives/C6A3NV5J9/p1664545970777239

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

The current "description" annotation of the PrometheusRuleFailures alert doesn't provide much context about what's happening and what to do next.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

always

Steps to Reproduce:

1. Deploy a PrometheusRule object in the openshift-monitoring namespace that fails to evaluate (annotation with invalid Go template for instance).
2.
3.

Actual results:

PrometheusRuleFailures but the description's annotation doesn't provide lots of information about what to do next.

Expected results:

Detailed instructions.

Additional info:

https://coreos.slack.com/archives/C0VMT03S5/p1669658109764129

 

 

Description of problem:

We shouldn't enforce PSa in 4.13, neither by label sync, neither by global cluster config.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

100%

Steps to Reproduce:

As a cluster admin:
1. create two new namespaces/projects: pokus, openshift-pokus
2. as a cluster-admin, attempt to create a privileged pod in both the namespaces from 1.

Actual results:

pod creation is blocked by pod security admission

Expected results:

only a warning about pod violating the namespace pod security level should be emitted

Additional info:

This is currently a noop for 4.14

Description of problem:

When users adjust their browsers to small size, the deploymnet details page on the Topology page overrides the drop-down list component, which prevents the user from using the drop-down list functionality. All content on the dropdown list would be covered

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-24-103753

How reproducible:

Always

Steps to Reproduce:

1. Login OCP, go to developer perspective -> Topology page
2. Click and open one resource (eg: deployment), make sure the resource sidebar has been opened
3. Adjust the browser windows to small size
4. Check if the dropdown list component has been covered 

Actual results:

All the dorpdown list component will be covered by the deployment details page (See attachment for more details)

Expected results:

The dropdown list component should be displayed on the top, the function should work even if the windows is small

Additional info:

 

Description of the problem:
Assisted service fails to create ignition with mcs cert because configmaps "root-ca" is not found in hypershift spoke cluster.
Installed a 4.12 hypershift spoke cluster using MCE 2.2, and the assisted service logs show this repeating error: 

time="2022-11-29T13:30:59Z" level=error msg="failed to create ignition with mcs cert" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).ensureMCSCert" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:1177" bare_metal_host=hyper-worker-0-0-bmh bare_metal_host_namespace=hyper-0 error="configmaps \"root-ca\" not found" go-id=800 request_id=32d5a8d4-ca1c-4a5f-9a3f-9a89b1f2c0d9 

**
MCE version: 2.2.0-DOWNANDBACK-2022-11-28-09-21-11
 
$ oc version
Client Version: 4.12.0-0.nightly-2022-11-25-185455
Kustomize Version: v4.5.7
Kubernetes Version: v1.25.2+4bd0702
 
How reproducible:
100%

Steps to reproduce:

1. 

2.

3.

Actual results:

The assisted service is looking for "root-ca" configmap on the spoke, but the hypershift spoke doesn't have such configmap.

Expected results:

Add as a fallback a configmap that exists on the spoke when "root-ca" is not there 

 

 

 

Description of problem:

If you try to deploy with Internal publishing strategy, and you have either already have a pubilc gateway or already permitted the VPC subnet to the DNS service, deploy will always fail.

Version-Release number of selected component (if applicable):

 

How reproducible:

Easily

Steps to Reproduce:

1. Add a public gateway to VPC network and/or add VPC subnet to permitted DNS networks
2. Set publish strategy to Internal
3. Deploy

Actual results:

Deploy fails

Expected results:

If the resources exist simply skip trying to create them.

Additional info:

Fix here https://github.com/openshift/installer/pull/6481

Description of problem:

When the cluster install finished, wait-for install-complete command didn't exit as expected.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Get the latest agent-installer and build image
git clone https://github.com/openshift/installer.git
cd installer/
hack/build.sh
Edit agent-config and install-config yaml file
Create the agent.iso image:
OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release:4.12.0-ec.3-x86_64 bin/openshift-install agent create image --log-level debug

2. Install SNO cluster
virt-install --connect qemu:///system -n control-0 -r 33000 --vcpus 8 --cdrom ./agent.iso --disk pool=installer,size=120 --boot uefi,hd,cdrom --os-variant=rhel8.5 --network network=default,mac=52:54:00:aa:aa:aa --wait=-1 

3. Run 'bin/openshift agent wait-for bootstrap-complete --log-level debug' and the command finished as expected.

4. After 'bootstrap' completion, run 'bin/openshift agent wait-for install-complete --log-level debug', the command didn't finish as expected.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

The test:

test=[sig-storage] Volume limits should verify that all nodes have volume limits [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]

Is hard failing on aws and gcp techpreview clusters:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=%5Bsig-storage%5D%20Volume%20limits%20should%20verify%20that%20all%20nodes%20have%20volume%20limits%20%5BSkipped%3ANoOptionalCapabilities%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D

The failure message is consistently:

fail [github.com/onsi/ginkgo/v2@v2.1.5-0.20220909190140-b488ab12695a/internal/suite.go:612]: Dec 15 09:07:51.278: Expected volume limits to be set
Ginkgo exit error 1: exit with code 1

Sample failure:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.12-ocp-e2e-aws-ovn-arm64-techpreview/1603313676431921152

A fix for this will bring several jobs back to life, but they do span 4.12 and 4.13.

job=periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.13-e2e-gcp-sdn-techpreview=all
job=periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-aws-ovn-arm64-techpreview=all
job=periodic-ci-openshift-multiarch-master-nightly-4.12-ocp-e2e-aws-ovn-arm64-techpreview=all

Description of problem:

When adding an app via 'Add from git repo' my repo, which works with StoneSoup, throws an error around the contents of the devfile

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Dev viewpoint
2. Click +Add
3. Choose 'Import from Git'
4. Enter 'https://github.com/utherp0/bootcampapp

Actual results:

"Import is not possible. o.components is not iterable"

Expected results:

The Devfile works with StoneSoup

Additional info:

Devfile at https://github.com/utherp0/hacexample/blob/main/devfile.yaml

 Currently when the agent encounters seemingly irrecoverable errors it sleeps forever

This is not ideal because we're not truly confident that those errors are truly irrecoverable, and retrying might save the day. To avoid generating too much noise from such agents, the retry delay algorithm should use exponential backoff

 

Slack that sparked this ticket:

TL;DR a variety of API-side errors or configurations could cause a temporary condition where the API or some proxy in front of it returns a 404, 403, etc. Or the service itself may temporarily have failed/degraded/incorrect DB access, failed auth system, etc. that results in a similar response. We do not want such a situation to cause every agent to permanently stop.

https://coreos.slack.com/archives/CUPJTHQ5P/p1659975670378779 

https://coreos.slack.com/archives/CUPJTHQ5P/p1659970512620689?thread_ts=1659969753.223169&cid=CUPJTHQ5P 

Description of the problem:

In stage, trying to set the "integrate with platform" toggle to off, after it was on, returns the following:
Failed to update the cluster - Got invalid platform (baremetal) and/or user-managed-networking (<nil>)
Seems like the patch request sent to switch off the platform (trying to set it to baremetal) gets 400 response

How reproducible:

100%

Steps to reproduce:

1. Create cluster and boot 3 or more vmware hosts

2. switch on "Integrate with platform" toggle

3. Switch off the toggle

Actual results:

 

Expected results:

This is a clone of issue OCPBUGS-7836. The following is the description of the original issue:

Description of problem:

The MCDaemon has a codepath for "pivot" used in older versions, and then as part of solutions articles to initiate a direct pivot to an ostree version, mostly used when things fail.

As of 4.12 this codepath should no longer work due to us switching to new format OSImage, so we should fully deprecate it.

This is likely where it fails:
https://github.com/openshift/machine-config-operator/blob/ecc6bf3dc21eb33baf56692ba7d54f9a3b9be1d1/pkg/daemon/rpm-ostree.go#L248

Version-Release number of selected component (if applicable):

4.12+

How reproducible:

Not sure but should be 100%

Steps to Reproduce:

1. Follow https://access.redhat.com/solutions/5598401
2.
3.

Actual results:

fails

Expected results:

MCD telling you pivot is deprecated

Additional info:

 

Description of problem:

Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too. 

This makes the main command 

ovs-appctl -t ovs-monitor-ipsec

To hang until the subcommand is finished.

As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be  insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

How reproducible:

I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

Steps to Reproduce:

1. Install an Openshift cluster with IPSEC enabled
2. Scale to 170+ nodes or more
3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.

Actual results:

Ip Sec pods are stuck in a Crashloopbackoff state

Expected results:

Ip Sec pods to work normally

Additional info:

We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

 

Description of problem:

https://github.com/openshift/cluster-authentication-operator/pull/587 addresses an issue in which the auth operator goes degraded when the console capability is not enabled.  The rest is that the console publicAssetURL is not configured when the console is disabled.  However if the console capability is later enabled on the cluster, there is no logic in place to ensure the auth operator detects this and performs the configuration.

Manually restarting the auth operator will address this, but we should have a solution that handles it automatically.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Install a cluster w/o the console cap
2. Inspect the auth configmap, see that assetPublicURL is empty
3. Enable the console capability, wait for console to start up
4. Inspect the auth configmap and see it is still empty

Actual results:

assetPublicURL does not get populated

Expected results:

assetPublicURL is populated once the console is enabled

Additional info:


Description of problem:

openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors:

}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
W1018 11:36:43.523673      15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to {
  "Addr": "[2620:52:0:198::10]:2379",
  "ServerName": "2620:52:0:198::10",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-18-041406

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with single stack IPv6 via ZTP procedure

Actual results:

Deployment times out and some of the operators aren't deployed successfully.

NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-10-18-041406   False       False         True       124m    APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
baremetal                                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      112m    
cloud-controller-manager                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
cloud-credential                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
cluster-autoscaler                         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
config-operator                            4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
console                                                                                                                      
control-plane-machine-set                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
dns                                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
etcd                                       4.12.0-0.nightly-2022-10-18-041406   True        False         True       121m    ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries
image-registry                             4.12.0-0.nightly-2022-10-18-041406   False       True          True       104m    Available: The registry is removed...
ingress                                    4.12.0-0.nightly-2022-10-18-041406   True        True          True       111m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
insights                                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      118s    
kube-apiserver                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      102m    
kube-controller-manager                    4.12.0-0.nightly-2022-10-18-041406   True        False         True       107m    GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused
kube-scheduler                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
kube-storage-version-migrator              4.12.0-0.nightly-2022-10-18-041406   True        False         False      117m    
machine-api                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-approver                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-config                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
marketplace                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      116m    
monitoring                                                                      False       True          True       98m     deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found
network                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
node-tuning                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
openshift-apiserver                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      104m    
openshift-controller-manager               4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
openshift-samples                                                               False       True          False      103m    The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state
operator-lifecycle-manager                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-10-18-041406   True        False         False      106m    
service-ca                                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
storage                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m  

Expected results:

Deployment succeeds without issues.

Additional info:

I was unable to run must-gather so attaching the pods logs copied from the host file system.

Sample archive with both resources:

archives/compressed/3c/3cc4318d-e564-450b-b16e-51ef279b87fa/202209/30/200617.tar.gz

Sample query to find more archives:

with t as (
  select
    cluster_id,
    file_path,
    json_extract_scalar(content, '$.kind') as kind
  from raw_io_archives
  where date = '2022-09-30' and file_path like 'config/storage/%'
)
select cluster_id, count(*) as cnt
from t
group by cluster_id
order by cnt desc;

Description of problem:

Since way back in 4.8, we've had a banner with To request update recommendations, configure a channel that supports your version when ClusterVersion has RetrievedUpdates=False . But that's only one of several reasons we could be RetrievedUpdates=False. Can we pivot to passing through the ClusterVersion condition message?

Version-Release number of selected component (if applicable):

4.8 and later.

How reproducible:

100%

Steps to Reproduce:

1. Launch a cluster-bot cluster like 4.11.12.
2. Set a channel with oc adm upgrade channel stable-4.11.
3. Scale down the CVO with oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator.
4. Patch in a RetrievedUpdates condition with:

$ CONDITIONS="$(oc get -o json clusterversion version | jq -c '[.status.conditions[] | if .type == "RetrievedUpdates" then .status = "False" | .message = "Testing" else . end]')"
$ oc patch --subresource status clusterversion version --type json -p "[{\"op\": \"add\", \"path\": \"/status/conditions\", \"value\": ${CONDITIONS}}]"

5. View the admin console at /settings/cluster.

Actual results:

Advice about configuring the channel (but it's already configured).

Expected results:

See the message you patched into the RetrievedUpdates condition.

Description of problem:

Prometheus continuously restarts due to slow WAL replay

Version-Release number of selected component (if applicable):

openshift - 4.11.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-8282. The following is the description of the original issue:

Description of problem:

We should disable netlink mode of netclass collector in Node Exporter. The netlink mode of netclass collector is introduced in 4.13 into the Node Exporter. When using the netlink mode, several metrics become unavailable. So to avoid confusing our user when they upgrade the OCP cluster to a new version and find several metrics missing on the NICs. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Using default config of CMO, Node Exporter's netclass collector is running in netlink mode.
The argument `--collector.netclass.netlink` is present in the `node-exporter` container in `node-exporter` daemonset.

Expected results:

Using default config of CMO, Node Exporter's netclass collector is running in classic mode. 
The argument `--collector.netclass.netlink` is absent in the `node-exporter` container in `node-exporter` daemonset.

Additional info:

 

This is a clone of issue OCPBUGS-10864. The following is the description of the original issue:

Description of problem:

APIServer service not selected correctly for PublicAndPrivate when external-dns isn't configured. 
Image: 4.14 Hypershift operator + OCP 4.14.0-0.nightly-2023-03-23-050449

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}'
PublicAndPrivate

    - lastTransitionTime: "2023-03-24T15:13:15Z"
      message: Cluster operators console, dns, image-registry, ingress, insights,
        kube-storage-version-migrator, monitoring, openshift-samples, service-ca are
        not available
      observedGeneration: 3
      reason: ClusterOperatorsNotAvailable
      status: "False"
      type: ClusterVersionSucceeding

services:
  - service: APIServer
   servicePublishingStrategy:
    type: LoadBalancer
  - service: OAuthServer
   servicePublishingStrategy:
    type: Route
  - service: Konnectivity
   servicePublishingStrategy:
    type: Route
  - service: Ignition
   servicePublishingStrategy:
    type: Route
  - service: OVNSbDb
   servicePublishingStrategy:
    type: Route

jiezhao-mac:hypershift jiezhao$ oc get service -n clusters-jz-test | grep kube-apiserver
kube-apiserver            LoadBalancer  172.30.211.131  aa029c422933444139fb738257aedb86-9e9709e3fa1b594e.elb.us-east-2.amazonaws.com  6443:32562/TCP         34m
kube-apiserver-private        LoadBalancer  172.30.161.79  ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com  6443:32100/TCP         34m
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ cat hostedcluster.kubeconfig | grep server
  server: https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
E0324 11:17:44.003589   95300 memcache.go:238] couldn't get current server API group list: Get "https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443/api?timeout=32s": dial tcp 10.0.129.24:6443: i/o timeout

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create a PublicAndPrivate cluster without external-dns
2.access the guest cluster (it should fail)
3.

Actual results:

unable to access the guest cluster via 'oc get node --kubeconfig=<guest cluster kubeconfig>', some guest cluster co are not available

Expected results:

The cluster is up and running, the guest cluster can be accessed via 'oc get node --kubeconfig=<guest cluster kubeconfig>'

Additional info:

 

 

Description of problem:

0000_50_installer_coreos-bootimages.yaml in the OKD payload contains RHCOS image

Version-Release number of selected component (if applicable):

4.13.0

How reproducible:

Always

Steps to Reproduce:

1. oc adm release extract quay.io/openshift/okd:4.12.0-0.okd-2023-02-18-033438 2. Check `0000_50_installer_coreos-bootimages.yaml` contents
3.

Actual results:

RHCOS images are listed

Expected results:

FCOS images are listed

Additional info:

This file is generated by the installer in hack/build-coreos-manifest.go and included the payload. Currently the script ignores TAGS where okd-specific build can be specified

Description of problem:

opm alpha render-veneer basic doesn't support pipe stdin

Version-Release number of selected component (if applicable):

zhaoxia@xzha-mac OCP-53869 % opm version
Version: version.Version{OpmVersion:"7bc5831fd", GitCommit:"7bc5831fd6bd1c4f3494a29470f103de3f8f14f3", BuildDate:"2022-09-13T00:29:33Z", GoOs:"darwin", GoArch:"amd64"}

How reproducible:

always

Steps to Reproduce:

1. create catalog-basic-veneer.yaml 
zhaoxia@xzha-mac OCP-53869 % cat catalog-basic-veneer.yaml 
---
schema: olm.package
name: nginx-operator
defaultChannel: stable
---
schema: olm.channel
package: nginx-operator
name: stable
entries:
- name: nginx-operator.v0.0.1
- name: nginx-operator.v1.0.1
  replaces: nginx-operator.v0.0.1
---
schema: olm.bundle
image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1
---
schema: olm.bundle
image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1 

2. run "cat catalog-basic-veneer.yaml| opm alpha render-veneer basic"

zhaoxia@xzha-mac OCP-53869 % cat catalog-basic-veneer.yaml| opm alpha render-veneer basic
Error: accepts 1 arg(s), received 0
Usage:
  opm alpha render-veneer basic basic-veneer-file [flags]


Flags:
  -h, --help            help for basic
  -o, --output string   Output format (json|yaml) (default "json")


Global Flags:
      --skip-tls-verify   skip TLS certificate verification for container image registries while pulling bundles
      --use-http          use plain HTTP for container image registries while pulling bundles 

3.

Actual results:

opm alpha render-veneer basic doesn't support pipe stdin

Expected results:

"opm alpha render-veneer basic" should support pipe stdin like "opm alpha render-veneer semver"

Additional info:

zhaoxia@xzha-mac OCP-53869 % opm alpha render-veneer -h
Render a veneer type


Usage:
  opm alpha render-veneer [command]


Available Commands:
  basic       Generate a declarative config blob from a single 'basic veneer' file
  semver      Generate a file-based catalog from a single 'semver veneer' file 
When FILE is '-' or not provided, the veneer is read from standard input

Description of problem:

Reviewing https://console-openshift-console.apps.example.com/monitoring/dashboards/grafana-dashboard-api-performance?apiserver=kube-apiserver shows only OpenShift related resources in "etcd Object Count".

This is based on the fact that `etcd_object_counts` was deprecated in `kubernetes` version 1.21 and replaced with `apiserver_storage_objects` (see https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#other-cleanup-or-flake-3).

When running the query `topk(25, max(etcd_object_counts) by (resource))` from the above mentioned dashboard, only OpenShift resources are shown. But when running the query with `topk(25, max(apiserver_storage_objects) by (resource))` query it again shows all the expected resources. 

This means the dashboard on "/monitoring/dashboards/grafana-dashboard-api-performance?apiserver=kube-apiserver" showing "etcd Object Count" should be fixed and rather use `apiserver_storage_objects`.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13.0-ec.1

How reproducible:

- Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.10
2. Access https://console-openshift-console.apps.example.com/monitoring/dashboards/grafana-dashboard-api-performance?apiserver=kube-apiserver and view "etcd Object Count" 

Actual results:

Only resources from `openshift-apiserver` are being reported. Core `kubernetes` resources and others are missing.

Expected results:

All resources within `etcd` to be reported and shown to understand how much object of each resource is available in `etcd`

Additional info:

This was raised a while ago in https://bugzilla.redhat.com/show_bug.cgi?id=2110365 and is still a problem in early OpenShift Container Platform 4.13 releases.

 

Description of problem:

The origin issue is from SDB-3484. When a customer wants to update its pull-secret, we find that sometimes the insight operator does not execute the cluster transfer process with the message 'no available accepted cluster transfer'. The root cause is that the insight operator does the cluster transfer process per 24 hours, and the telemetry does the registration process per 24 hours, on the ams side, both the two call /cluster_registration do the same process, so it means the telemetry will complete the cluster transfer before the insight operator. 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Create two OCP clusters.
2. Create a PSR that will help create two 'pending' CTs. The pending CTs will be accepted after ~6 hours.
3. Wait for ~24 hours, check the PSR, and check the logs in IO, and also check the pull-secrets in the clusters.

Actual results:

The PSR is completed, but there is no successfully transfer logs in IO, and the pull-secrets in the clusters are not updated. 

Expected results:

The transfer process is executed successfully, and the pull-secrets are updated on the clusters.

Additional info:


This is a clone of issue OCPBUGS-11020. The following is the description of the original issue:

Description of problem:

Viewing OperatorHub details page will return error page

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-28-180259

How reproducible:

Always on Hypershift Guest cluster

Steps to Reproduce:

1. Visit OperatorHub details page via Administration -> Cluster Settings -> Configuration -> OperatorHub 
2.
3.

Actual results:

Cannot read properties of undefined (reading 'sources')

Expected results:

page can be loaded successfully

Additional info:

screenshot one: https://drive.google.com/file/d/12cgpChKYuen2v6DWvmMrir273wONo5oY/view?usp=share_link
screenshot two: https://drive.google.com/file/d/1vVsczu7ScIqznoKNsR8V0w4k9bF1xWhB/view?usp=share_link 

The cluster-kube-apiserver-operator CI has been constantly failing for the past week and more specifically the e2e-gcp-operator job because the test cluster ends in a state where a lot of requests start failing with "Unauthorized" errors.

This caused multiple operators to become degraded and tests to fail.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1450/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator/1631333936435040256

Looking at the failures and a must-gather we were able to capture inside of a test cluster, it turned out that the service account issuer could be the culprit here. Because of that we opened https://issues.redhat.com/browse/API-1549.

However, it turned that disabling TestServiceAccountIssuer didn't resolve the issue and the cluster was still too unstable for the tests to pass.

In a separate attempt we also tried disabling TestBoundTokenSignerController and this time the tests were passing. However, the cluster was still very unstable during the e2e run and the kube-apiserver-operator went degraded a couple of times: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1455/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator/1632871645171421184/artifacts/e2e-gcp-operator/gather-extra/artifacts/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-5cf9d4569-m2spq_kube-apiserver-operator.log.

On top of that instead of seeing Unauthorized errors, we are now seeing a lot of connection refused.

 AITRIAGE-3085 does none-platform (user managed networking) on a multi-node cluster, this is rather extraordinary and it deserves to be noted in the "Configured features:" section by triggering feature usage flags in the database

Description of problem:

opm serve fails with message:

Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

(The easiest reproducer involves serving an empty catalog)

1. mkdir /tmp/catalog

2. using Dockerfile /tmp/catalog.Dockerfile based on 4.12 docs (https://access.redhat.com/documentation/en-us/openshift_container_platform/4.12/html-single/operators/index#olm-creating-fb-catalog-image_olm-managing-custom-catalogs
# The base image is expected to contain
# /bin/opm (with a serve subcommand) and /bin/grpc_health_probe
FROM registry.redhat.io/openshift4/ose-operator-registry:v4.12

# Configure the entrypoint and command
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]

# Copy declarative config root into image at /configs
ADD catalog /configs

# Set DC-specific label for the location of the DC root directory
# in the image
LABEL operators.operatorframework.io.index.configs.v1=/configs

3. build the image `cd /tmp/ && docker build -f catalog.Dockerfile .`

4. execute an instance of the container in docker/podman `docker run --name cat-run [image-file]`

5. error

Using a dockerfile generated from opm (`opm generate dockerfile [dir]`) works, but includes precache and cachedir options to opm.

 

Actual results:

Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root

Expected results:

opm generates cache in default /tmp/cache location and serves without error

Additional info:

 

 

Description of problem:

OnDelete update strategy create two replace machines when deleting a master machine

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-11-29-035943

How reproducible:

Not sure, I met twice on this template cluster
https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_13/ipi-on-vsphere/versioned-installer-vmc7_techpreview

Steps to Reproduce:

1.Launch a 4.13 cluster on Vsphere with techpreview enabled, we use automated template: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_13/ipi-on-vsphere/versioned-installer-vmc7_techpreview
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-11-29-035943   True        False         56m     Cluster version is 4.13.0-0.nightly-2022-11-29-035943 

2.Replace master machines one by one with index 3,4,5
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                             PHASE     TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3       Running                          57m
huliu-vs13d-rcr7z-master-4       Running                          35m
huliu-vs13d-rcr7z-master-5       Running                          12m
huliu-vs13d-rcr7z-worker-ngw2j   Running                          7h12m
huliu-vs13d-rcr7z-worker-p2xd7   Running                          7h12m
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      29m     
baremetal                                  4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
cloud-controller-manager                   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
cloud-credential                           4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h37m   
cluster-api                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
cluster-autoscaler                         4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h32m   
config-operator                            4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
console                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      28m     
control-plane-machine-set                  4.13.0-0.nightly-2022-11-29-035943   True        False         False      5h12m   
csi-snapshot-controller                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
dns                                        4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h32m   
etcd                                       4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h31m   
image-registry                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      74m     
ingress                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h21m   
insights                                   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h26m   
kube-apiserver                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h22m   
kube-controller-manager                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h31m   
kube-scheduler                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h30m   
kube-storage-version-migrator              4.13.0-0.nightly-2022-11-29-035943   True        False         False      74m     
machine-api                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h23m   
machine-approver                           4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
machine-config                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      27m     
marketplace                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h32m   
monitoring                                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h19m   
network                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
node-tuning                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h32m   
openshift-apiserver                        4.13.0-0.nightly-2022-11-29-035943   True        False         False      30m     
openshift-controller-manager               4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h26m   
openshift-samples                          4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h25m   
operator-lifecycle-manager                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h26m   
platform-operators-aggregated              4.13.0-0.nightly-2022-11-29-035943   True        False         False      20m     
service-ca                                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
storage                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      5h16m   
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                             PHASE     TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3       Running                          77m
huliu-vs13d-rcr7z-master-4       Running                          55m
huliu-vs13d-rcr7z-master-5       Running                          32m
huliu-vs13d-rcr7z-worker-ngw2j   Running                          7h32m
huliu-vs13d-rcr7z-worker-p2xd7   Running                          7h32m 

3.Create CPMS, yaml as below:
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  name: cluster
  namespace: openshift-machine-api
spec:
  replicas: 3
  state: Active
  strategy:
    type: OnDelete
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      metadata: 
        labels:
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
          machine.openshift.io/cluster-api-cluster: huliu-vs13d-rcr7z
      spec:
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices:
              - networkName: qe-segment
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: huliu-vs13d-rcr7z-rhcos
            userDataSecret:
              name: master-user-data
            workspace:
              datacenter: SDDC-Datacenter
              datastore: WorkloadDatastore
              folder: /SDDC-Datacenter/vm/huliu-vs13d-rcr7z
              resourcePool: /SDDC-Datacenter/host/Cluster-1/Resources
              server: vcenter.sddc-44-236-21-251.vmwarevmc.com

liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlplanemachineset_vsphere.yaml
controlplanemachineset.machine.openshift.io/cluster created
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   9s 

4.Edit CPMS, change numCPUs to 8 to trigger update
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      31m     
baremetal                                  4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
cloud-controller-manager                   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h37m   
cloud-credential                           4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h38m   
cluster-api                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
cluster-autoscaler                         4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
config-operator                            4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
console                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      29m     
control-plane-machine-set                  4.13.0-0.nightly-2022-11-29-035943   True        True          False      5h14m   Observed 3 replica(s) in need of update
csi-snapshot-controller                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
dns                                        4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
etcd                                       4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h33m   
image-registry                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      75m     
ingress                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h23m   
insights                                   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h27m   
kube-apiserver                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h24m   
kube-controller-manager                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h32m   
kube-scheduler                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h32m   
kube-storage-version-migrator              4.13.0-0.nightly-2022-11-29-035943   True        False         False      75m     
machine-api                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h24m   
machine-approver                           4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
machine-config                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      28m     
marketplace                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
monitoring                                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h21m   
network                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
node-tuning                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h34m   
openshift-apiserver                        4.13.0-0.nightly-2022-11-29-035943   True        False         False      31m     
openshift-controller-manager               4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h27m   
openshift-samples                          4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h26m   
operator-lifecycle-manager                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h27m   
platform-operators-aggregated              4.13.0-0.nightly-2022-11-29-035943   True        False         False      21m     
service-ca                                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h35m   
storage                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      5h18m   
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                             PHASE     TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3       Running                          79m
huliu-vs13d-rcr7z-master-4       Running                          57m
huliu-vs13d-rcr7z-master-5       Running                          33m
huliu-vs13d-rcr7z-worker-ngw2j   Running                          7h34m
huliu-vs13d-rcr7z-worker-p2xd7   Running                          7h34m

5.Delete master machine one by one, found it create two master machines when delete huliu-vs13d-rcr7z-master-4

liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs13d-rcr7z-master-5
machine.machine.openshift.io "huliu-vs13d-rcr7z-master-5" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3         Running                               79m
huliu-vs13d-rcr7z-master-4         Running                               57m
huliu-vs13d-rcr7z-master-5         Deleting                              33m
huliu-vs13d-rcr7z-master-6b9x7-5   Provisioning                          5s
huliu-vs13d-rcr7z-worker-ngw2j     Running                               7h34m
huliu-vs13d-rcr7z-worker-p2xd7     Running                               7h34m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3         Running                          91m
huliu-vs13d-rcr7z-master-4         Running                          69m
huliu-vs13d-rcr7z-master-6b9x7-5   Running                          12m
huliu-vs13d-rcr7z-worker-ngw2j     Running                          7h46m
huliu-vs13d-rcr7z-worker-p2xd7     Running                          7h46m
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      53m     
baremetal                                  4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
cloud-controller-manager                   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h59m   
cloud-credential                           4.13.0-0.nightly-2022-11-29-035943   True        False         False      8h      
cluster-api                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
cluster-autoscaler                         4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
config-operator                            4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h57m   
console                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      18m     
control-plane-machine-set                  4.13.0-0.nightly-2022-11-29-035943   True        True          False      18m     Observed 2 replica(s) in need of update
csi-snapshot-controller                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h57m   
dns                                        4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
etcd                                       4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h55m   
image-registry                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      97m     
ingress                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h45m   
insights                                   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h49m   
kube-apiserver                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h46m   
kube-controller-manager                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h54m   
kube-scheduler                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h54m   
kube-storage-version-migrator              4.13.0-0.nightly-2022-11-29-035943   True        False         False      97m     
machine-api                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h46m   
machine-approver                           4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
machine-config                             4.13.0-0.nightly-2022-11-29-035943   True        False         False      50m     
marketplace                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
monitoring                                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h42m   
network                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h57m   
node-tuning                                4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h56m   
openshift-apiserver                        4.13.0-0.nightly-2022-11-29-035943   True        False         False      53m     
openshift-controller-manager               4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h49m   
openshift-samples                          4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h48m   
operator-lifecycle-manager                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h57m   
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h57m   
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h49m   
platform-operators-aggregated              4.13.0-0.nightly-2022-11-29-035943   True        False         False      10m     
service-ca                                 4.13.0-0.nightly-2022-11-29-035943   True        False         False      7h57m   
storage                                    4.13.0-0.nightly-2022-11-29-035943   True        False         False      5h40m   
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs13d-rcr7z-master-4
machine.machine.openshift.io "huliu-vs13d-rcr7z-master-4" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3         Running                               101m
huliu-vs13d-rcr7z-master-4         Deleting                              79m
huliu-vs13d-rcr7z-master-6b9x7-5   Running                               22m
huliu-vs13d-rcr7z-master-8h9p9-4   Provisioning                          6s
huliu-vs13d-rcr7z-master-df78v-4   Provisioning                          6s
huliu-vs13d-rcr7z-worker-ngw2j     Running                               7h56m
huliu-vs13d-rcr7z-worker-p2xd7     Running                               7h56m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs13d-rcr7z-master-3         Running                          115m
huliu-vs13d-rcr7z-master-6b9x7-5   Running                          36m
huliu-vs13d-rcr7z-master-8h9p9-4   Running                          14m
huliu-vs13d-rcr7z-master-df78v-4   Running                          14m
huliu-vs13d-rcr7z-worker-ngw2j     Running                          8h
huliu-vs13d-rcr7z-worker-p2xd7     Running                          8h

Actual results:

When deleting a mater machine, two replace machines created

Expected results:

When deleting a mater machine, only one replace machine created

Additional info:

Must-gather 
https://drive.google.com/file/d/1VVxGPW0WNDc3CxiJIg90dAQckEWhYy8i/view?usp=sharing

Description of problem:

Deploying RHV with OCP 4.12 fails when applying terraform. 

 

How reproducible:

Always on iSCSI storage

 

Steps to Reproduce:

Installing OCP on RHV 4.12 on iSCSI storage

 

Actual results:

Installation fails when uploading the image, e.g. with the following error:

time="2023-01-18T04:24:03Z" level=debug msg="Changes to Outputs:"
time="2023-01-18T04:24:03Z" level=debug msg="  + tmp_import_vm_id = (known after apply)"
time="2023-01-18T04:24:03Z" level=debug msg="ovirt_disk_from_image.releaseimage[0]: Creating..."
time="2023-01-18T04:24:13Z" level=debug msg="ovirt_disk_from_image.releaseimage[0]: Still creating... [10s elapsed]"
time="2023-01-18T04:24:23Z" level=debug msg="ovirt_disk_from_image.releaseimage[0]: Still creating... [20s elapsed]"
time="2023-01-18T04:24:33Z" level=debug msg="ovirt_disk_from_image.releaseimage[0]: Still creating... [30s elapsed]"
time="2023-01-18T04:24:43Z" level=debug msg="ovirt_disk_from_image.releaseimage[0]: Still creating... [40s elapsed]"
time="2023-01-18T04:24:52Z" level=error
time="2023-01-18T04:24:52Z" level=error msg="Error: Failed to create disk."
time="2023-01-18T04:24:52Z" level=error
time="2023-01-18T04:24:52Z" level=error msg="  with ovirt_disk_from_image.releaseimage[0],"
time="2023-01-18T04:24:52Z" level=error msg="  on main.tf line 16, in resource \"ovirt_disk_from_image\" \"releaseimage\":"
time="2023-01-18T04:24:52Z" level=error msg="  16: resource \"ovirt_disk_from_image\" \"releaseimage\" {"
time="2023-01-18T04:24:52Z" level=error
time="2023-01-18T04:24:52Z" level=error msg="permanent_http_error: non-retryable error encountered while transferring"
time="2023-01-18T04:24:52Z" level=error msg="image for disk 03bac7db-9be0-403b-9d35-ea2f43de4e56 via HTTP request to"
time="2023-01-18T04:24:52Z" level=error msg="https://rhv-node03.cicd.red-chesterfield.com:54322/images/222826ae-b249-402e-937c-74f5ab34bbe8,"
time="2023-01-18T04:24:52Z" level=error msg="giving up (permanent_http_error: unexpected client status code (416) received"
time="2023-01-18T04:24:52Z" level=error msg="for image upload)"
time="2023-01-18T04:24:52Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"image\" stage: failed to create cluster: failed to apply Terraform: exit status 1\n\nError: Failed to create disk.\n\n  with ovirt_disk_from_image.releaseimage[0],\n  on main.tf line 16, in resource \"ovirt_disk_from_image\" \"releaseimage\":\n  16: resource \"ovirt_disk_from_image\" \"releaseimage\" {\n\npermanent_http_error: non-retryable error encountered while transferring\nimage for disk 03bac7db-9be0-403b-9d35-ea2f43de4e56 via HTTP request to\nhttps://rhv-node03.cicd.red-chesterfield.com:54322/images/222826ae-b249-402e-937c-74f5ab34bbe8,\ngiving up (permanent_http_error: unexpected client status code (416) received\nfor image upload)\n"
time="2023-01-18T04:24:53Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=error msg="error provisioning cluster" error="exit status 4" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=debug msg="Unable to find log storage actuator. Disabling gathering logs." installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=info msg="saving installer output" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=debug msg="installer console log: level=info msg=Consuming Install Config from target directory\nlevel=info msg=Manifests created in: manifests and openshift\nlevel=warning msg=Found override for release image (quay.io/openshift-release-dev/ocp-release:4.12.0-multi). Please be warned, this is not advised\nlevel=info msg=Consuming OpenShift Install (Manifests) from target directory\nlevel=info msg=Consuming Common Manifests from target directory\nlevel=info msg=Consuming Openshift Manifests from target directory\nlevel=info msg=Consuming Worker Machines from target directory\nlevel=info msg=Consuming Master Machines from target directory\nlevel=info msg=Ignition-Configs created in: . and auth\nlevel=info msg=Consuming Master Ignition Config from target directory\nlevel=info msg=Consuming Worker Ignition Config from target directory\nlevel=info msg=Consuming Bootstrap Ignition Config from target directory\nlevel=info msg=Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.12/builds/412.86.202212081411-0/x86_64/rhcos-412.86.202212081411-0-openstack.x86_64.qcow2.gz?sha256=4304c8f0f7429cfdb534c28aec7ca72e360e608a5df4ddf50f9b4816ff499277'\nlevel=info msg=Creating infrastructure resources...\nlevel=error\nlevel=error msg=Error: Failed to create disk.\nlevel=error\nlevel=error msg=  with ovirt_disk_from_image.releaseimage[0],\nlevel=error msg=  on main.tf line 16, in resource \"ovirt_disk_from_image\" \"releaseimage\":\nlevel=error msg=  16: resource \"ovirt_disk_from_image\" \"releaseimage\" {\nlevel=error\nlevel=error msg=permanent_http_error: non-retryable error encountered while transferring\nlevel=error msg=image for disk 03bac7db-9be0-403b-9d35-ea2f43de4e56 via HTTP request to\nlevel=error msg=https://rhv-node03.cicd.red-chesterfield.com:54322/images/222826ae-b249-402e-937c-74f5ab34bbe8,\nlevel=error msg=giving up (permanent_http_error: unexpected client status code (416) received\nlevel=error msg=for image upload)\nlevel=error msg=failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"image\" stage: failed to create cluster: failed to apply Terraform: exit status 1\nlevel=error\nlevel=error msg=Error: Failed to create disk.\nlevel=error\nlevel=error msg=  with ovirt_disk_from_image.releaseimage[0],\nlevel=error msg=  on main.tf line 16, in resource \"ovirt_disk_from_image\" \"releaseimage\":\nlevel=error msg=  16: resource \"ovirt_disk_from_image\" \"releaseimage\" {\nlevel=error\nlevel=error msg=permanent_http_error: non-retryable error encountered while transferring\nlevel=error msg=image for disk 03bac7db-9be0-403b-9d35-ea2f43de4e56 via HTTP request to\nlevel=error msg=https://rhv-node03.cicd.red-chesterfield.com:54322/images/222826ae-b249-402e-937c-74f5ab34bbe8,\nlevel=error msg=giving up (permanent_http_error: unexpected client status code (416) received\nlevel=error msg=for image upload)\nlevel=error\n" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=debug msg="no additional log fields found" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=error msg="failed due to install error" error="exit status 4" installID=c9k5l26q
time="2023-01-18T04:24:53Z" level=fatal msg="runtime error" error="exit status 4"

Expected results:

A successfully deployed cluster 

 

Additional info:

Description of problem:

Customer is running machine learning (ML) tasks on OpenShift Container Platform, for which large models need to be embedded in the container image. When building a new container image with large container image layers (>=10GB) and pushing it to the internal image registry, this fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

In the image registry Pod we can see the following error message:

time="2023-01-30T14:12:22.315726147Z" level=error msg="upload resumed at wrong offest: 10485760000 != 10738341637" [..]
time="2023-01-30T14:12:22.338264863Z" level=error msg="response completed with error" err.code="blob upload invalid" err.message="blob upload invalid" [..]

Backend storage is AWS S3. We suspect that this could be the following upstream bug: https://github.com/distribution/distribution/issues/1698

Version-Release number of selected component (if applicable):

Customer encountered the issue on OCP 4.11.20. We reproduced the issue on OCP 4.11.21:

$  oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.11.21
Kubernetes Version: v1.24.6+5658434

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform cluster 4.11.21 on AWS
2. Confirm registry storage is on AWS S3
3. Create a new build including a 10GB file using the following command: `printf "FROM registry.fedoraproject.org/fedora:37\nRUN dd if=/dev/urandom of=/bigfile bs=1M count=10240" | oc new-build -D -`
4. Wait for some time for the build to run

Actual results:

Pushing the new build fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

Expected results:

Push of large container image layers succeeds

Additional info:

BZ reference https://bugzilla.redhat.com/show_bug.cgi?id=2081597

Created attachment 1876869 [details]
Install disk role is not set in host disk list

Description of the problem:
The install disk does not have a Role populated in host expanded detail in disks list. (See attached screenshot)

Release version:

Operator snapshot version:

OCP version:

Browser Info:

Steps to reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

This is a clone of issue OCPBUGS-10333. The following is the description of the original issue:

Description of problem:

Missing workload annotations from deployments. This is in relation to the openshift/platform-operator repo.

Missing annotations.

Namespace name, `workload.openshift.io/allowed: management`

`target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'`. That annotation is required for the admission webhook to modify the resource for workload pinning. 

Related Enhancements: 
https://github.com/openshift/enhancements/pull/703 
https://github.com/openshift/enhancements/pull/1213

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-2153. The following is the description of the original issue:

When ProjectID is not set, TenantID might be ignored in MAPO.

Context: When setting additional networks in Machine templates, networks can be identified by the means of a filter. The network filter has both TenantID and ProjectID as fields. TenantID was ignored.

Steps to reproduce:
Create a Machine or a MachineSet with a template containing a Network filter that sets a TenantID.

```
networks:

  • filter:
    id: 'the-network-id'
    tenantId: '123-123-123'
    ```

One cheap way of testing this could be to pass a valid network ID and set a bogus tenantID. If the machine gets associated with the network, then tenantID has been ignored and the bug is present. If instead MAPO errors, then in means that it has taken tenantID into consideration.

Description of problem:

Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters

Steps to Reproduce:

1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name

Actual results:

See error message:

encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]

Expected results:

Create a cluster/machineset using the existing and valid DiskEncryptionSet

Additional info:

I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513

Description of problem:

mapi_instance_create_failed metric cannot work when set acceleratedNetworking: true on Azure

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-12-23-223710

How reproducible:

Always

Steps to Reproduce:

1.Create a machineset with invalid vmSize and acceleratedNetworking: true
Copy a default machineset, change name, and change vmSize to an invalid one

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
huliu-az29-bbl8d-master-0                Running   Standard_D8s_v3   westus          176m
huliu-az29-bbl8d-master-1                Running   Standard_D8s_v3   westus          176m
huliu-az29-bbl8d-master-2                Running   Standard_D8s_v3   westus          176m
huliu-az29-bbl8d-worker-invalid1-5j8zb   Failed                                      39m
huliu-az29-bbl8d-worker-invalid2-cntwc   Failed                                      33m
huliu-az29-bbl8d-worker-westus-2jzvd     Running   Standard_D4s_v3   westus          171m
huliu-az29-bbl8d-worker-westus-fb4pz     Running   Standard_D4s_v3   westus          171m
huliu-az29-bbl8d-worker-westus-vw7pw     Running   Standard_D4s_v3   westus          171m
liuhuali@Lius-MacBook-Pro huali-test % 

huliu-az29-bbl8d-worker-invalid1 this machineset doesn’t set acceleratedNetworking: true
huliu-az29-bbl8d-worker-invalid2 this machineset set acceleratedNetworking: true 

2.Check mapi_instance_create_failed metric on UI
Only have one mapi_instance_create_failed metric for huliu-az29-bbl8d-worker-invalid1-5j8zb

Actual results:

Machine create failed with acceleratedNetworking: true hasn’t mapi_instance_create_failed metric

Expected results:

Machine create failed with acceleratedNetworking: true should have mapi_instance_create_failed metric

Additional info:

Test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-36989

This is a clone of issue OCPBUGS-7485. The following is the description of the original issue:

Description of problem:

When Creating Sample Devfile from the Samples Page, corresponding Topology Icon for the app is not set. This issue is not observed when we create a BuildImage from the Samples page.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Create a Sample Devfile App from the Samples Page
2. Go to the Topology Page and check the icon of the app created.

Actual results:

The generic Openshift logo is displayed

Expected results:

Need to show the corresponding app icon (Golang, Quarkus, etc.)

Additional info:

In case of creating sample of BuilderImage, the icon gets properly set as per the BuilderImage used.

Current label: app.openshift.io/runtime=dotnet-basic
Change to: app.openshift.io/runtime=dotnet

Description of problem:

Configure both IPv4 and IPv6 addresses in api/ingress in install-config.yaml, install the cluster using agent-based installer. The cluster provisioned has only IPv4 stack for API/Ingress

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. As description
2.
3.

Actual results:

The cluster provisioned has only IPv4 stack for API/Ingress

Expected results:

The cluster provisioned has both IPv4 and IPv6 for API/Ingress

Additional info:

 

Tracker issue for bootimage bump in 4.13. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-5959.

Description of problem:

On 4.13, installer failed to parse client certificate when using certificate-based Service Principal with passpharse, error is as below:

[fedora@preserve-jima 4.13.0-0.nightly-2023-02-13-235211]$ ./openshift-install create install-config --dir test             
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub          
? Platform azure
WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" 
FATAL failed to fetch Install Config: failed to fetch dependency of "Install Config": failed to fetch dependency of "Base Domain": failed to generate asset "Platform": failed to parse client certificate: pkcs12: decryption password incorrect 

The content of osServicePrincipal.json:
[fedora@preserve-jima 4.13.0-0.nightly-2023-02-13-235211]$ cat ~/.azure/osServicePrincipal.json 
{"subscriptionId":"xxxxx-xxx-xxx-xxx-xxx","clientId":"xxxxx-xxx-xxx-xxx-xxx","tenantId":"xxxxx-xxx-xxx-xxx-xxx","clientCertificate":"/home/fedora/azure/client-certs/cert.pfx","clientCertificatePassword":"PASSWORD"}

when creating PEM certificate and pfx file without passpharse, installer can parse certs correctly and continue the installation.

Issue also does not reproduce on 4.12 by using certificate-based SP with/without passpharse.

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-13-235211

How reproducible:

Always on 4.13

Steps to Reproduce:

1. Generate certificates pem and pfx file with passpharse
2. Add public cert in existing Service Principal on azure portal, and config ~/.azure/osServicePrincipal.json
3. Trigger installation

Actual results:

installer failed to parse certificate

Expected results:

Installation is successful.

Additional info:

Issue only happens on 4.13, certificate-based SP with passpharse

 

 

 

 

 

This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:

Description of problem:

Unit test failing 

=== RUN   TestNewAppRunAll/app_generation_using_context_dir
    newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby'
    --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)


Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648 

Actual results:

unit tests fail

Expected results:

TestNewAppRunAll unit test should pass

Additional info:

 

Description of problem:

A build which works on 4.12 errored out on 4.13.

Version-Release number of selected component (if applicable):

oc --context build02 get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-ec.3   True        False         4d2h    Cluster version is 4.13.0-ec.3

How reproducible:

Always

Steps to Reproduce:

1. oc new-project hongkliu-test
2. oc create is test-is --as system:admin
3. oc apply -f test-bc.yaml # the file is in the attachment

Actual results:

oc --context build02 logs test-bc-5-build
Defaulted container "docker-build" out of: docker-build, manage-dockerfile (init)
time="2023-02-20T19:13:38Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
I0220 19:13:38.405163       1 defaults.go:112] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
Caching blobs under "/var/cache/blobs".Pulling image image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08 ...
Trying to pull image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08...
Getting image source signatures
Copying blob sha256:aa8ae8202b42d1c70c3a7f65680eabc1c562a29227549b9a1b33dc03943b20d2
Copying blob sha256:31326f32ac37d5657248df0a6aa251ec6a416dab712ca1236ea40ca14322a22c
Copying blob sha256:b21786fe7c0d7561a5b89ca15d8a1c3e4ea673820cd79f1308bdfd8eb3cb7142
Copying blob sha256:68296e6645b26c3af42fa29b6eb7f5befa3d8131ef710c25ec082d6a8606080d
Copying blob sha256:6b1c37303e2d886834dab68eb5a42257daeca973bbef3c5d04c4868f7613c3d3
Copying blob sha256:cbdbe7a5bc2a134ca8ec91be58565ec07d037386d1f1d8385412d224deafca08
Copying blob sha256:46cf6a1965a3b9810a80236b62c42d8cdcd6fb75f9b58d1b438db5736bcf2669
Copying config sha256:9aefe4e59d3204741583c5b585d4d984573df8ff751c879c8a69379c168cb592
Writing manifest to image destination
Storing signatures
Adding transient rw bind mount for /run/secrets/rhsm
STEP 1/4: FROM image-registry.openshift-image-registry.svc:5000/ci/html-proofer@sha256:684aae4e929e596f7042c34a3604c81137860187305f775c2380774bda4b6b08
STEP 2/4: RUN apk add --no-cache bash
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.11/community/x86_64/APKINDEX.tar.gz
(1/1) Installing bash (5.0.11-r1)
Executing bash-5.0.11-r1.post-install
ERROR: bash-5.0.11-r1.post-install: script exited with error 127
Executing busybox-1.31.1-r9.trigger
ERROR: busybox-1.31.1-r9.trigger: script exited with error 127
1 error; 21 MiB in 40 packages
error: build error: building at STEP "RUN apk add --no-cache bash": while running runtime: exit status 1

Expected results:

 

Additional info:

Run the build on build01 (4.12.4) and it works fine.

oc --context build01 get clusterversion version
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.4    True        False         2d11h   Cluster version is 4.12.4

Description of problem:

After configuring a webhook receiver in alertmanager to send alerts to an external tool, a customer noticed that when receiving alerts they have as source "https:///<console-url>" (notice the 3 /).

Version-Release number of selected component (if applicable):

OCP 4.10

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

https:///<console-url>

Expected results:

https://<console-url>

Additional info:

After investigating I discovered that the problem might be in the CMO code:

→ oc get Alertmanager main -o yaml | grep externalUrl
  externalUrl: https:/console-openshift-console.apps.jakumar-2022-11-27-224014.devcluster.openshift.com/monitoring
→ oc get Prometheus k8s -o yaml | grep externalUrl
  externalUrl: https:/console-openshift-console.apps.jakumar-2022-11-27-224014.devcluster.openshift.com/monitoring

Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1415

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-7921. The following is the description of the original issue:

Description of problem:

Tested on gcp, there are 4 failureDomains a, b, c, f in CPMS, remove one a, a new master will be created in f. If readd f to CPMS, instance will be moved back from f to a

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

Before update cpms.
      failureDomains:
        gcp:
        - zone: us-central1-a
        - zone: us-central1-b
        - zone: us-central1-c
        - zone: us-central1-f
$ oc get machine                  
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp22-4glmq-master-2         Running   n2-standard-4   us-central1   us-central1-c   3h4m
zhsungcp22-4glmq-master-hzsf2-0   Running   n2-standard-4   us-central1   us-central1-b   90m
zhsungcp22-4glmq-master-plch8-1   Running   n2-standard-4   us-central1   us-central1-a   11m
zhsungcp22-4glmq-worker-a-cxf5w   Running   n2-standard-4   us-central1   us-central1-a   3h
zhsungcp22-4glmq-worker-b-d5vzm   Running   n2-standard-4   us-central1   us-central1-b   3h
zhsungcp22-4glmq-worker-c-4d897   Running   n2-standard-4   us-central1   us-central1-c   3h

1. Delete failureDomain "zone: us-central1-a" in cpms, new machine Running in zone f.
      failureDomains:
        gcp:
        - zone: us-central1-b
        - zone: us-central1-c
        - zone: us-central1-f 
$ oc get machine              
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp22-4glmq-master-2         Running   n2-standard-4   us-central1   us-central1-c   3h19m
zhsungcp22-4glmq-master-b7pdl-1   Running   n2-standard-4   us-central1   us-central1-f   13m
zhsungcp22-4glmq-master-hzsf2-0   Running   n2-standard-4   us-central1   us-central1-b   106m
zhsungcp22-4glmq-worker-a-cxf5w   Running   n2-standard-4   us-central1   us-central1-a   3h16m
zhsungcp22-4glmq-worker-b-d5vzm   Running   n2-standard-4   us-central1   us-central1-b   3h16m
zhsungcp22-4glmq-worker-c-4d897   Running   n2-standard-4   us-central1   us-central1-c   3h16m
2. Add failureDomain "zone: us-central1-a" again, new machine running in zone a, the machine in zone f will be deleted.
      failureDomains:
        gcp:
        - zone: us-central1-a
        - zone: us-central1-f
        - zone: us-central1-c
        - zone: us-central1-b
$ oc get machine                          
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp22-4glmq-master-2         Running   n2-standard-4   us-central1   us-central1-c   3h35m
zhsungcp22-4glmq-master-5kltp-1   Running   n2-standard-4   us-central1   us-central1-a   12m
zhsungcp22-4glmq-master-hzsf2-0   Running   n2-standard-4   us-central1   us-central1-b   121m
zhsungcp22-4glmq-worker-a-cxf5w   Running   n2-standard-4   us-central1   us-central1-a   3h32m
zhsungcp22-4glmq-worker-b-d5vzm   Running   n2-standard-4   us-central1   us-central1-b   3h32m
zhsungcp22-4glmq-worker-c-4d897   Running   n2-standard-4   us-central1   us-central1-c   3h32m  

Actual results:

Instance is moved back from f to a

Expected results:

Instance shouldn't be moved back from f to a

Additional info:

https://issues.redhat.com//browse/OCPBUGS-7366

Description of problem:

New master will be created if add duplicated failuredomains in controlplanemachineset

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-06-054655

How reproducible:

Always

Steps to Reproduce:

1. Update controlplanemachineset and add a duplicated failuredomains us-east-2a in the first position of failuredomains

      failureDomains:
        aws:
        - placement:
            availabilityZone: us-east-2a
          subnet:
            filters:
            - name: tag:Name
              values:
              - zhsun117-x6jjt-private-us-east-2a
            type: Filters
        - placement:
            availabilityZone: us-east-2a
          subnet:
            filters:
            - name: tag:Name
              values:
              - zhsun117-x6jjt-private-us-east-2a
            type: Filters
        - placement:
            availabilityZone: us-east-2b
          subnet:
            filters:
            - name: tag:Name
              values:
              - zhsun117-x6jjt-private-us-east-2b
            type: Filters
        - placement:
            availabilityZone: us-east-2c
          subnet:
            filters:
            - name: tag:Name
              values:
              - zhsun117-x6jjt-private-us-east-2c
            type: Filters
        platform: AWS
2.
3.

Actual results:

A new master will be created in duplicated zone us-east-2a and the old master in zone us-east-2c will be removed.
$ oc get machine                 
NAME                                     PHASE     TYPE         REGION      ZONE         AGE
zhsun117-x6jjt-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   5h37m
zhsun117-x6jjt-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   5h37m
zhsun117-x6jjt-master-w8785-2            Running   m6i.xlarge   us-east-2   us-east-2a   15m
zhsun117-x6jjt-worker-us-east-2a-nxn6j   Running   m6i.xlarge   us-east-2   us-east-2a   5h34m
zhsun117-x6jjt-worker-us-east-2b-7vmr8   Running   m6i.xlarge   us-east-2   us-east-2b   5h34m
zhsun117-x6jjt-worker-us-east-2c-2zwwv   Running   m6i.xlarge   us-east-2   us-east-2c   5h34m

I1107 08:28:56.243804       1 provider.go:416]  "msg"="Created machine" "controller"="controlplanemachineset" "failureDomain"="AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:Filters, Value:&[{Name:tag:Name Values:[zhsun117-x6jjt-private-us-east-2a]}]}}" "index"=2 "machineName"="zhsun117-x6jjt-master-lzs4c-2" "name"="zhsun117-x6jjt-master-4v8wl-2" "namespace"="openshift-machine-api" "reconcileID"="eec9a27c-4b7e-467a-b28c-6470c3068ab2" "updateStrategy"="RollingUpdate"

Expected results:

Cluster no update.

Additional info:

If add the duplicate failuredomains us-east-2a at the end in failuredomains, it does not trigger update.

Description of problem:

Currently we are not gathering Machine objects. We got nomination for a rule that will use this resource.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

etcd failed to add the 3rd node hence the installation fail

Version-Release number of selected component (if applicable):

 

How reproducible:

sometimes

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

This started to happen in multi-node installations after this PR merged:
https://github.com/openshift/assisted-installer/pull/615

 

Description of problem:

PipelineRun templates are currently fetched from `openshift-pipelines` namespace. It has to be fetched from `openshift` namespace.

Version-Release number of selected component (if applicable):
4.11 and 1.8.1 OSP

Align with operator changes https://issues.redhat.com/browse/SRVKP-2413 in 1.8.1, UI has to update the code to fetch pipelinerun templates from openshift namespace.

openshift-4 tracking bug for ose-baremetal-installer-container: see the bugs linked in the "Blocks" field of this bug for full details of the security issue(s).

This bug is never intended to be made public, please put any public notes in the blocked bugs.

Impact: Important
Public Date: 24-May-2022
Resolve Bug By: 14-Jun-2022

In case the dates above are already past, please evaluate this bug in your next prioritization review and make a decision then.

Please see the Security Errata Policy for further details: https://docs.engineering.redhat.com/x/9RBqB

Description of problem:

When we create a MC that deploys a unit, and this unit has a content and the value mask=true, then the node becomes degraded because of a driftconfig error like this one:

E1118 16:41:42.485314    1900 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e701d8c471184e3a66756b26b4b7dd33: mode mismatch for file: "/etc/systemd/system/maks-and-contents.service"; expected: -rw-r--r--/420/0644; received: Lrwxrwxrwx/134218239/01000000777

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-19-191518

How reproducible:

Always

Steps to Reproduce:

1. Create this machine config resource

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: mask-and-content
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - name: maks-and-contents.service
        mask: true
        contents: |
          [Unit]
          Description=Just random content

Actual results:

The worker MCP becomes degraded, and this error is reported in the MCD:

E1118 16:41:42.485314    1900 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-e701d8c471184e3a66756b26b4b7dd33: mode mismatch for file: "/etc/systemd/system/maks-and-contents.service"; expected: -rw-r--r--/420/0644; received: Lrwxrwxrwx/134218239/01000000777
 

Expected results:

Until config drift functionality was added, if a unit was masked, then the content was ignored.

If what happens is that this configuration is not allowed, the error message should report a more descriptive message.
 

Additional info:

It is not enough to restore the desiredConfig value in the degraded nodes. These are the steps to recover the node:

1. Edit the node's annotations and make  desiredConfig = currentConfig
2. Remove file /etc/machine-config-daemon/currentconfig  in the node
3. Flush the journal in the node. 
$ journalctl --rotate; journalctl --vacuum-time=1s

4. create the force file in the node
$ touch /run/machine-config-daemon-force

 

 

We should avoid errors like:

$ oc get -o json clusterversion version | jq -r '.status.history[0].acceptedRisks'
Forced through blocking failures: Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=True (), so the update from 4.13.0-0.okd-2022-12-11-064650 to 4.13.0-0.okd-2022-12-13-052859 is probably neither recommended nor supported.

Instead, tweak the logic from OCPBUGS-2727, and only append the Forced through blocking failures: prefix when the forcing was required.

Description of problem:

The current version of openshift/router vendors Kubernetes 1.25 packages. OpenShift 4.13 is based on Kubernetes 1.26.   

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/router/blob/release-4.13/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.25

Expected results:

Kubernetes packages are at version v0.26.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

Description of problem:

keepalived should not be part of vsphere upi

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Install vsphere upi and check keepalived pod it will be in Pending status

Steps to Reproduce:

1.
2.
3.

Actual results:

 keepalived pod will be in Pending status
 

Expected results:

 keepalived pod should not be part of vsphere upi
 

Additional info:

 

When cluster has only one failure-domain, the nodes will not be tagged and hence it doesn't make sense to deploy storage with topological configuration in that case.

On disabling the helm and 
import-from-samples actions in customization, Helm Charts and Samples options are still enabled in topology add actions.

Under 

spec:
    customization:
        addPage:
           disabledActions:

Insert snippet of Add page actions. (attached screenshot for reference)

Actual result:

Helm Charts and Samples options are still enabled in topology add actions even after disabling them in customization

Expected result:

Helm Charts and Samples options should be disabled(hidden)

Description of problem:

Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes:

62m         Normal    Scheduled                pod/redhat-operators-zb4j7                         Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93
62m         Normal    AddedInterface           pod/redhat-operators-zb4j7                         Add eth0 [10.129.1.112/23] from ovn-kubernetes
62m         Normal    Pulling                  pod/redhat-operators-zb4j7                         Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11"
62m         Normal    Pulled                   pod/redhat-operators-zb4j7                         Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms
62m         Normal    Created                  pod/redhat-operators-zb4j7                         Created container registry-server
62m         Normal    Started                  pod/redhat-operators-zb4j7                         Started container registry-server
62m         Warning   Unhealthy                pod/redhat-operators-zb4j7                         Startup probe failed: timeout: failed to connect service ":50051" within 1s
62m         Normal    Killing                  pod/redhat-operators-zb4j7                         Stopping container registry-server


Increasing the threshold of the probe might fix the problem:
  livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload
2. Inspect redhat-operator pod in openshift-marketplace namespace
3. Observe the resource usage ( CPU and Memory ) of the pod 

Actual results:

Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup

Expected results:

Redhat-operator startup probe succeeding and no spikes in resource on master nodes

Additional info:

Attached cpu, memory and event traces.

 

Description of problem:

When creating a pod controller (e.g. deployment) with pod spec that will be mutated by SCCs, the users might still get a warning about the pod not meeting given namespace pod security level.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

100%

Steps to Reproduce:

1. create a namespace with restricted PSa warning level (the default)
2. create a deployment with a pod with an empty security context

Actual results:

You get a warning about the deployment's pod not meeting the NS's pod security admission requirements.

Expected results:

No warning if the pod for the deployment would be properly mutated by SCCs in order to fulfill the NS's pod security requirements.

Additional info:

originally implemented as a part of https://issues.redhat.com/browse/AUTH-337

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7559. The following is the description of the original issue:

Description of problem:

When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.

Version-Release number of selected component (if applicable):

4.12.3

How reproducible:

Consistent

Steps to Reproduce:

1. On a long lived cluster, add a new machineset

Actual results:

Machines reach "Provisioned" but don't join the cluster

Expected results:

Machines join cluster as nodes

Additional info:


This is a clone of issue OCPBUGS-13017. The following is the description of the original issue:

aws-ebs-csi-driver-controller-ca ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

 Currently our default for journal logs is 5 hours though we need to bring full journal logs as sometimes relevant issue is coming day before

Description of problem:

Installer get stuck at the beginning of installation if BYO private hosted zone is configured in install-config, from the CI logs, installer has no actions in 2 hours.

Errors:
level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred"
185
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:164","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2023-03-05T16:44:27Z"}


Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-23-000343

How reproducible:

Always

Steps to Reproduce:

1. Create an install-config.yaml, and config byo private hosted zone
2. Create the cluster

Actual results:

installer showed the following message and then get stuck, the cluster can not be created.

level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred"

Expected results:

create cluster successfully

Additional info:

 

Description of problem:

Nodes are taking more than 5m0s to stage OSUpdate

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-upgrade-from-nightly-4.12-ocp-ovn-remote-libvirt-s390x/1641274324021153792

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

{  4 nodes took over 5m0s to stage OSUpdate:

node/libvirt-s390x-1-1-f88-7mbq9-worker-0-2hrgh OSUpdateStarted at 2023-03-30T05:34:03Z, OSUpdateStaged at 2023-03-30T05:45:40Z: 11m37s
node/libvirt-s390x-1-1-f88-7mbq9-master-2 OSUpdateStarted at 2023-03-30T05:34:15Z, OSUpdateStaged at 2023-03-30T05:45:32Z: 11m17s
node/libvirt-s390x-1-1-f88-7mbq9-worker-0-hzqzf OSUpdateStarted at 2023-03-30T05:55:50Z, OSUpdateStaged at 2023-03-30T06:01:31Z: 5m41s
node/libvirt-s390x-1-1-f88-7mbq9-master-0 OSUpdateStarted at 2023-03-30T05:54:46Z, OSUpdateStaged at 2023-03-30T06:02:24Z: 7m38s}

Expected results:

 

Additional info:

 

Description of problem:

Config single zone in install-config.yaml file:
$ cat install-config-1.yaml 
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    vsphere:
      zones:
      - us-east-1
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere:
      zones:
      - us-east-1
  replicas: 3
metadata:
  name: jimavmc
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 172.31.248.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    apiVIP: 172.31.248.137
    cluster: Cluster-1
    datacenter: SDDC-Datacenter
    defaultDatastore: WorkloadDatastore
    ingressVIP: 172.31.248.141
    network: qe-segment
    password: xxx
    username: xxx
    vCenter: xxx
    vcenters:
    - server: xxx
      user: xxx
      password: xxx
      datacenters:
      - SDDC-Datacenter
    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      topology:
        computeCluster: /SDDC-Datacenter/host/Cluster-1
        networks:
        - qe-segment
        datacenter: SDDC-Datacenter
        datastore: WorkloadDatastore
      server: xxx
publish: External
pullSecret: xxx
sshKey: xxx

Continue to install cluster, installer fails, it could not find folder when import ova image.
$ ./openshift-install create cluster --dir ipi-single-node-1 --log-level debug
DEBUG OpenShift Installer 4.12.0-0.nightly-2022-09-20-095559 
DEBUG Built from commit 64257bdd65a293f6b0c8d748fe8c51b0f17b8b2d  
...
DEBUG vsphere_tag_category.category: Creating...   
DEBUG vsphere_tag_category.category: Creation complete after 0s [id=urn:vmomi:InventoryServiceCategory:b2a23d3f-979f-4d7a-a204-792992108468:GLOBAL] 
DEBUG vsphere_tag.tag: Creating...                 
DEBUG vsphere_tag.tag: Creation complete after 0s [id=urn:vmomi:InventoryServiceTag:f128ab99-b4d1-4543-8e91-d7c929abf8a5:GLOBAL] 
DEBUG vsphere_folder.folder["SDDC-Datacenter-jimavmc-plnsq"]: Creating... 
DEBUG vsphereprivate_import_ova.import[0]: Creating... 
DEBUG vsphere_folder.folder["SDDC-Datacenter-jimavmc-plnsq"]: Creation complete after 1s [id=group-v951730] 
ERROR                                              
ERROR Error: failed to find provided vSphere objects: folder '/SDDC-Datacenter/vm/jimavmc-plnsq' not found 
ERROR                                              
ERROR   with vsphereprivate_import_ova.import[0],  
ERROR   on main.tf line 70, in resource "vsphereprivate_import_ova" "import": 
ERROR   70: resource "vsphereprivate_import_ova" "import" { 
ERROR                                              
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "pre-bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 
ERROR                                              
ERROR Error: failed to find provided vSphere objects: folder '/SDDC-Datacenter/vm/jimavmc-plnsq' not found 
ERROR                                              
ERROR   with vsphereprivate_import_ova.import[0],  
ERROR   on main.tf line 70, in resource "vsphereprivate_import_ova" "import": 
ERROR   70: resource "vsphereprivate_import_ova" "import" { 
ERROR                                              
ERROR 

Actually, folder has been created
$ govc ls /SDDC-Datacenter/vm/ | grep jimavmc
/SDDC-Datacenter/vm/jimavmc-plnsq                                          

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-095559

How reproducible:

always

Steps to Reproduce:

1. Prepare install-config.yaml file and configure single zone
2. openshift-install create cluster  
3. 

Actual results:

Folder could not be found when importing ova image

Expected results:

installation is successful

Additional info:

 

 

Description of problem:
There is an endless re-render loop and a browser feels slow to stuck when opening the add page or the topology.

Saw also endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds

Version-Release number of selected component (if applicable):
1. Console UI 4.12-4.13 (master)
2. Service Binding Operator (tested with 1.3.1)

How reproducible:
Always with installed SBO

But the "stuck feeling" depends on the browser (Firefox feels more stuck) and your locale machine power

Steps to Reproduce:
1. Install Service Binding Operator
2. Create or update the BindableKinds resource "bindable-kinds"

apiVersion: binding.operators.coreos.com/v1alpha1
kind: BindableKinds
metadata:
  name: bindable-kinds

3. Open the browser console log
4. Open the console UI and navigate to the add page

Actual results:
1. Saw endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
2. Browser feels slow and get stuck after some time
3. The page crashs after some time

Expected results:
1. The API call should be called just once
2. The add page should just work without feeling laggy
3. No crash

Additional info:
Get introduced after we watching the bindable-kinds resource with https://github.com/openshift/console/pull/11161

It looks like this happen only if the SBO is installed and the bindable-kinds resource exist, but doesn't contain any status.

The status list all available bindable resource types. I could not reproduce this by installing and uninstalling an operator, but you can manually create or update this resource as mentioned above.

Description of problem:

virtual media provisioning fails when iLO Ironic driver is used

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers
2.
3.

Actual results:

Provisioning fails with "An auth plugin is required to determine endpoint URL" error

Expected results:

Provisioning succeeds

Additional info:

Relevant log snippet:

3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75      78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL
 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last):
 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection
 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     task.driver.boot.prepare_ramdisk(task, ramdisk_params=params)
 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped
 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     result = f(*args, **kwargs)
 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk
 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     iso = image_utils.prepare_deploy_iso(task, ramdisk_params,
 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso
 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     return prepare_iso_image(inject_files=inject_files)
 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image
 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     image_url = img_handler.publish_image(
 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image
 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     swift_api = swift.SwiftAPI()
 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__
 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     endpoint = keystone.get_endpoint('swift', session=session)

Description of problem:

For multi-arch payload we have two types for mac :mac-arm64 and mac-amd64 , and now we have two oc client for those platform , when use command `oc adm release extract --command=oc --command-os='*'`  will extract two types oc for mac os, but for command `oc adm release extract --command=oc --command-os='mac'` only extract mac-amd . so need to update for the option “--command-os=mac”

Version-Release number of selected component (if applicable):

oc version --client
Client Version: 4.12.0-0.nightly-2022-12-01-184212

Steps:

1) Use `oc adm release extract --command=oc --command-os='*'  --to=/tmp/macs  quay.io/openshift-release-dev/ocp-release-nightly@sha256:8c0aefc2e2ad7f4feaf8382c6f9dbf7eada45b22dc341c8eee30bffb07c79852` 
2) Use `oc adm release extract --command=oc --command-os='mac'  --to=/tmp/mac  quay.io/openshift-release-dev/ocp-release-nightly@sha256:8c0aefc2e2ad7f4feaf8382c6f9dbf7eada45b22dc341c8eee30bffb07c79852`

Actual result :

1) Extract all available version for oc ;
2) Only extract mac-amd64 for mac;

Expected result:

2) detail the option or extract both version mac-amd64 and mac-arm64 ;

Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.

In `oc adm release new` only two components are allowed to be specified:

  • kubernetes
  • machine-os

These are being extracted from operator labels and displayed on release controller release page. RHCOS version is set in machine-os and links to release browser, which works great for OCP.

However, for OKD we base our machine-os version on FCOS. After layering has been implemented we specify the base FCOS version in machine-os. OKD is also installing cri-o - and its version is not getting recorded. In some cases OKD machine-os also overrides kernel version (some versions may break e2e tests), so it would be useful to allow `oc` to set cri-o and kernel version labels and have them displayed on release controller

Description of problem:

When a ClusterVersion's `status.availableUpdates` has a value of `null` and `Upgradeable=False`, a run time error occurs on the Cluster Settings page as the UpdatesGraph component expects `status.availableUpdates` to have a non-empty value.

Steps to Reproduce:

1.  Add the following overrides to ClusterVersion config (/k8s/cluster/config.openshift.io~v1~ClusterVersion/version)

spec:
  overrides:
    - group: apps
      kind: Deployment
      name: console-operator
      namespace: openshift-console-operator
      unmanaged: true    
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      name: console-operator
      namespace: ''
      unmanaged: true
2.  Visit /settings/cluster and note the run-time error (see attached screenshot) 

Actual results:

An error occurs.

Expected results:

The contents of the Cluster Settings page render.

Description of problem:

The cluster-ingress-operator's udpateIngressClass function logs "updated IngressClass" on a failure, when it should be logging that on a success.

Version-Release number of selected component (if applicable):

4.8+

How reproducible:

Easily

Steps to Reproduce:

# Simulate a change in an ingressclass that will be reconciled
oc apply -f - <<EOF                                                                                                                 apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: openshift-default
spec:
  controller: openshift.io/ingress-to-route
  parameters:
    apiGroup: operator.openshift.io
    kind: IngressController
    name: default
    scope: Namespace 
    namespace: "test"
EOF

# Look at logs
oc logs -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') -c ingress-operator | grep "updated IngressClass"

#No output

Actual results:

<none>

Expected results:

2023-01-26T20:37:19.210Z    INFO    operator.ingressclass_controller    ingressclass/ingressclass.go:63    updated IngressClass ...

Additional info:

 

Description of problem:

When using "Manage columns" dialog, the tooltip remains displayed after the dialog is closed

Version-Release number of selected component (if applicable):

KCP v4.12.0-160

How reproducible:

always

Steps to Reproduce:

1. Go to Virtualization > VirtualMachines or Templates,
   a.k.a. open VM or Templates list
2. Click on "Manage columns" icon
3. Change something and click Save or just click Cancel 2.

Actual results:

"Manage columns" tooltip is displayed

Expected results:

"Manage columns" tooltip is hidden

Additional info:

Hovering with mouse hides the tooltip.

Original bugzilla ticket:
https://bugzilla.redhat.com/show_bug.cgi?id=2141596

This problem is related to ListPageFilter component:
https://github.com/openshift/console/blob/master/frontend/public/components/factory/ListPage/ListPageFilter.tsx

 

 

Description of problem:

See: https://issues.redhat.com/browse/CPSYN-143

tldr:  Based on the previous direction that 4.12 was going to enforce PSA restricted by default, OLM had to make a few changes because the way we run catalog pods (and we have to run them that way because of how the opm binary worked) was incompatible w/ running restricted.

1) We set openshift-marketplace to enforce restricted (this was our choice, we didn't have to do it, but we did)
2) we updated the opm binary so catalog images using a newer opm binary don't have to run privileged
3) we added a field to catalogsource that allows you to choose whether to run the pod privileged(legacy mode) or restricted.  The default is restricted.  We made that the default so that users running their own catalogs in their own NSes (which would be default PSA enforcing) would be able to be successful w/o needing their NS upgraded to privileged.

Unfortunately this means:
1) legacy catalog images(i.e. using older opm binaries) won't run on 4.12 by default (the catalogsource needs to be modified to specify legacy mode.
2) legacy catalog images cannot be run in the openshift-marketplace NS since that NS does not allow privileged pods.  This means legacy catalogs can't contribute to the global catalog (since catalogs must be in that NS to be in the global catalog).

Before 4.12 ships we need to:
1) remove the PSA restricted label on the openshift-marketplace NS
2) change the catalogsource securitycontextconfig mode default to use "legacy" as the default, not restricted.

This gives catalog authors another release to update to using a newer opm binary that can run restricted, or get their NSes explicitly labeled as privileged (4.12 will not enforce restricted, so in 4.12 using the legacy mode will continue to work)

In 4.13 we will need to revisit what we want the default to be, since at that point catalogs will start breaking if they try to run in legacy mode in most NSes.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

OLM is setting the "openshift.io/scc" label to "anyuid" on several namespaces:

https://github.com/openshift/operator-framework-olm/blob/d817e09c2565b825afd8bfc9bb546eeff28e47e7/manifests/0000_50_olm_00-namespace.yaml#L23
https://github.com/openshift/operator-framework-olm/blob/d817e09c2565b825afd8bfc9bb546eeff28e47e7/manifests/0000_50_olm_00-namespace.yaml#L8

this label has no effect and will lead to confusion.  It should be set to emptystring for now (removing it entirely will have no effect on upgraded clusters because the CVO does not remove deleted labels, so the next best thing is to clear the value).

For bonus points, OLM should remove the label entirely from the manifest and add migration logic to remove the existing label from these namespaces to handle upgraded clusters that already have it.

Version-Release number of selected component (if applicable):

Not sure how long this has been an issue, but fixing it in 4.12+ should be sufficient.

How reproducible:

always

Steps to Reproduce:

1. install cluster
2. examine namespace labels

Actual results:

label is present

Expected results:


ideally label should not be present, but in the short term setting it to emptystring is the quick fix and is better than nothing.

Because of the way the InstallConfig asset is designed, we currently have no way to override the validation to allow the different platform validations for the agent installer. Currently we work around this by checking the command line arguments in the platform validation code to see what the install method is, but without access to the command line parser this cannot be made completely robust. (It also has no effect in unit tests.)

Some refactoring is required to permit us to call the validation code with different flags, without causing a large amount of code duplication.

This is a clone of issue OCPBUGS-10649. The following is the description of the original issue:

Description of problem:

After a replace upgrade from OCP 4.14 image to another 4.14 image first node is in NotReady.

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                     STATUS   ROLES  AGE   VERSION
ip-10-0-128-175.us-east-2.compute.internal  Ready   worker  72m   v1.26.2+06e8c46
ip-10-0-134-164.us-east-2.compute.internal  Ready   worker  68m   v1.26.2+06e8c46
ip-10-0-137-194.us-east-2.compute.internal  Ready   worker  77m   v1.26.2+06e8c46
ip-10-0-141-231.us-east-2.compute.internal  NotReady  worker  9m54s  v1.26.2+06e8c46

- lastHeartbeatTime: "2023-03-21T19:48:46Z"
  lastTransitionTime: "2023-03-21T19:42:37Z"
  message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
   message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/.
   Has your network provider started?'
  reason: KubeletNotReady
  status: "False"
  type: Ready

Events:
 Type   Reason          Age         From          Message
 ----   ------          ----        ----          -------
 Normal  Starting         11m         kubelet        Starting kubelet.
 Normal  NodeHasSufficientMemory 11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientMemory
 Normal  NodeHasNoDiskPressure  11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
 Normal  NodeHasSufficientPID   11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientPID
 Normal  NodeAllocatableEnforced 11m         kubelet        Updated Node Allocatable limit across pods
 Normal  Synced          11m         cloud-node-controller Node synced successfully
 Normal  RegisteredNode      11m         node-controller    Node ip-10-0-141-231.us-east-2.compute.internal event: Registered Node ip-10-0-141-231.us-east-2.compute.internal in Controller
 Warning ErrorReconcilingNode   17s (x30 over 11m) controlplane      nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

ovnkube-master log:

I0321 20:55:16.270197       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270273       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:17.851497       1 master.go:719] Adding or Updating Node "ip-10-0-137-194.us-east-2.compute.internal"
I0321 20:55:25.965132       1 master.go:719] Adding or Updating Node "ip-10-0-128-175.us-east-2.compute.internal"
I0321 20:55:45.928694       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432145 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:55:46.270129       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270154       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270164       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:55:46.270201       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270284       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:52.916512       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 5 items received
I0321 20:56:06.910669       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 12 items received
I0321 20:56:15.928505       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432175 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:56:16.269611       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269637       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269646       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:56:16.269688       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269697       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269724       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

cluster-network-operator log:

I0321 21:03:38.487602       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.488312       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:38.499825       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.571013       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available
I0321 21:03:39.450912       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.450953       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.493206       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.494050       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:39.508538       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.684429       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. management cluster 4.13
2. bring up the hostedcluster and nodepool in 4.14.0-0.nightly-2023-03-19-234132
3. upgrade the hostedcluster to 4.14.0-0.nightly-2023-03-20-201450 
4. replace upgrade the nodepool to 4.14.0-0.nightly-2023-03-20-201450 

Actual results

First node is in NotReady

Expected results:

All nodes should be Ready

Additional info:

No issue with replace upgrade from 4.13 to 4.14

 

 

 

 

 

 

Description of problem:

One multus case always fail in QE e2e testing. Using same net-attach-def and pod configure files, testing passed in 4.11 but failed in 4.12 and 4.13

Version-Release number of selected component (if applicable):

4.12 and 4.13

How reproducible:

All the times

Steps to Reproduce:

[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/NetworkAttachmentDefinitions/runtimeconfig-def-ipandmac.yaml
networkattachmentdefinition.k8s.cni.cncf.io/runtimeconfig-def created
[weliang@weliang networking]$ oc get net-attach-def -o yaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2023-01-03T16:33:03Z"
    generation: 1
    name: runtimeconfig-def
    namespace: test
    resourceVersion: "64139"
    uid: bb26c08f-adbf-477e-97ab-2aa7461e50c4
  spec:
    config: '{ "cniVersion": "0.3.1", "name": "runtimeconfig-def", "plugins": [{ "type":
      "macvlan", "capabilities": { "ips": true }, "mode": "bridge", "ipam": { "type":
      "static" } }, { "type": "tuning", "capabilities": { "mac": true } }] }'
kind: List
metadata:
  resourceVersion: ""
[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/Pods/runtimeconfig-pod-ipandmac.yaml
pod/runtimeconfig-pod created
[weliang@weliang networking]$ oc get pod
NAME                READY   STATUS              RESTARTS   AGE
runtimeconfig-pod   0/1     ContainerCreating   0          6s
[weliang@weliang networking]$ oc describe pod runtimeconfig-pod
Name:         runtimeconfig-pod
Namespace:    test
Priority:     0
Node:         weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal/10.0.128.4
Start Time:   Tue, 03 Jan 2023 11:33:45 -0500
Labels:       <none>
Annotations:  k8s.v1.cni.cncf.io/networks: [ { "name": "runtimeconfig-def", "ips": [ "192.168.22.2/24" ], "mac": "CA:FE:C0:FF:EE:00" } ]
              openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  runtimeconfig-pod:
    Container ID:   
    Image:          quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k5zqd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-k5zqd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               26s   default-scheduler  Successfully assigned test/runtimeconfig-pod to weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal
  Normal   AddedInterface          24s   multus             Add eth0 [10.128.2.115/23] from openshift-sdn
  Warning  FailedCreatePodSandBox  23s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(cff792dbd07e8936d04aad31964bd7b626c19a90eb9d92a67736323a1a2303c4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character /
  Normal   AddedInterface          7s    multus             Add eth0 [10.128.2.116/23] from openshift-sdn
  Warning  FailedCreatePodSandBox  7s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(d2456338fa65847d5dc744dea64972912c10b2a32d3450910b0b81cdc9159ca4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character /
[weliang@weliang networking]$ 
 

Actual results:

Pod is not running

Expected results:

Pod should be in running state

Additional info:

 

Sprig is a dependency of cno which is in turn a dependency of multiple projects while the old sprig has a vulnerability.

The old message implied that the host failed to boot from the installation disk, but in practice this timeout is mainly a result of misconfigured network, preventing the host from pulling ignition, the new message reflects that.

Description of the problem:

Sometimes, when a host is removed from a cluster, the "sufficient-masters-count" validation still returns "success" when it shouldn't.

Steps to reproduce:

1. Create a MNO cluster, add three hosts.

2. Optionally, rename the hosts (from the UI POV, this is when we enable the Next button on the Host Discovery page).

3. Delete one of the hosts.

Actual results:

The next GET hosts and GET cluster requests (which the UI does right after the DELETE request finishes) return the correct number of hosts. The "sufficient-master-count" validation succeeds even though it shouldn't anymore.

Expected results:

"sufficient-master-count" validation should return failure when the number of master is <3 for MNO.

Description of problem:

Create network LoadBalancer service, but always get Connection time out when accessing the LB

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-27-135134

How reproducible:

100%

Steps to Reproduce:

1. create custom ingresscontroller that using Network LB service

$ Domain="nlb.$(oc get dns.config cluster -o=jsonpath='{.spec.baseDomain}')"
$ oc create -f - << EOF
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: nlb
  namespace: openshift-ingress-operator
spec:
  domain: ${Domain}
  replicas: 3
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: External
    type: LoadBalancerService
EOF


2. wait for the ingress NLB service is ready.

$ oc -n openshift-ingress get svc/router-nlb
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP                                                                     PORT(S)                      AGE
router-nlb   LoadBalancer   172.30.75.134   a765a5eb408aa4a68988e35b72672379-78a76c339ded64fa.elb.us-east-2.amazonaws.com   80:31833/TCP,443:32499/TCP   117s


3. curl the network LB

$ curl a765a5eb408aa4a68988e35b72672379-78a76c339ded64fa.elb.us-east-2.amazonaws.com -I
<hang>

Actual results:

Connection time out

Expected results:

curl should return 503

Additional info:

the NLB service has the annotation:
  service.beta.kubernetes.io/aws-load-balancer-type: nlb

 

Description of problem:

OVN to OVN migration went in via https://github.com/openshift/cluster-network-operator/pull/1584. Seems like its not working as expected on 4.12 and 4.13

Following annotations are not being added to the nodes post oc patch

k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip:
k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac:

Also Network pods needs to be restarted which doesn't seem so

Although CNO config part looks ok

defaultNetwork:
      ovnKubernetesConfig:
        egressIPConfig: {}
        gatewayConfig:
          routingViaHost: false
        genevePort: 6081
        hybridOverlayConfig:
          hybridClusterNetwork:
          - cidr: 10.132.0.0/14
            hostPrefix: 23

Version-Release number of selected component (if applicable):

4.12 and 4.13

How reproducible:

Always

Steps to Reproduce:

1.Bring up OVN cluster
2.Migrate to OVN-H by following oc patch 

oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"hybridOverlayConfig":{"hybridClusterNetwork": [{"cidr": "10.132.0.0/14", "hostPrefix": 23}]}}}}}' 

Actual results:

Cluster seems not to be migrated successfully to ONV-H

Expected results:

Cluster should migrate to OVN-H runtime successfully

Additional info:

Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.2792199018664948416/

Description of problem:

Clicking the logo in the masthead always goes to `/dashboards`, even if metrics are disabled. If metrics are disabled, `Home > Overview` is removed from the left nav, but clicking the logo in the masthead still allows you to navigate there.

Description of problem:

4.2 AWS boot images such as ami-01e7fdcb66157b224 include the old ignition.platform.id=ec2 kernel command line parameter. When launched against 4.12.0-rc.3, new machines fail with:

  1. The old user-data and old AMI successfully get to the machine-config-server request stage.
  2. The new instance will then request the full Ignition from /config/worker , and the machine-config server translates that to the old Ignition v2 spec format.
  3. The instance will lay down that Ignition-formatted content, and then try and reboot into the new state.
  4. Coming back up in the new state, the modern Afterburn comes up to try and figure out a node name for the kubelet, and this fails with unknown provider 'ec2'.

Version-Release number of selected component (if applicable):

coreos-assemblers used ignition.platform.id=ec2, but pivoted to =aws here. It's not clear when that made its way into new AWS boot images. Some time after 4.2 and before 4.6.

Afterburn dropped support for legacy command-line options like the ec2 slug in 5.0.0. But it's not clear when that shipped into RHCOS. The release controller points at this RHCOS diff, but that has afterburn-0-5.3.0-1 builds on both sides.

How reproducible:

100%, given a sufficiently old AMI and a sufficiently new OpenShift release target.

Steps to Reproduce:

  1. Install 4.12.0-rc.3 or similar new OpenShift on AWS in us-east-1.
  2. Create Ignition v2 user-data in a Secret in openshift-machine-api. I'm fuzzy on how to do that portion easily, since it's basically RFE-3001 backwards.
  3. Edit a compute MachineSet to set spec.template.spec.providerSpec.value.ami to id: ami-01e7fdcb66157b224 and also point it at your v2 user-data Secret.
  4. Possibly delete an existing Machine in that MachineSet, or raise replicas, or otherwise talk the MachineSet controller into provisioning a new Machine to pick up the reconfigured AMI.

Actual results:

The new Machine will get to Provisioned but fail to progress to Running. systemd journal logs will include unknown provider 'ec2' for Afterburn units.

Expected results:

Old boot-image AMIs can successfully update to 4.12.

Alternatively, we pin down the set of exposed boot images sufficiently that users with older clusters can audit for exposure and avoid the issue by updating to more modern boot images (although updating boot images is not trivial, see RFE-3001 and the Ignition spec 2 to 3 transition discussed in kcs#5514051.

 https://redhat-internal.slack.com/archives/C014N2VLTQE/p1673948811821979

 

Job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/branch-ci-openshift-assisted-service-master-edge-publish-python-client/1615200110763839488, fails with:

b"vcversioner: ['git', 'describe', '--tags', '--long'] failed and '/home/assisted-service/build/ci-op-pb304424/assisted-service-client/version.txt' isn't present.\nvcversioner: are you installing from a github tarball?\nvcversioner: -- VCS output follows --\nvcversioner: fatal: detected dubious ownership in repository at '/home/assisted-service'\nvcversioner: To add an exception for this directory, call:\nvcversioner: \nvcversioner: \tgit config --global --add safe.directory /home/assisted-service\n" 

We need to mark the repo "safe": 

git config --system --add safe.directory '*' 

Description of problem:

Start maintenance action moved from Nodes tab to Bare Metal Hosts tab

Version-Release number of selected component (if applicable):

Cluster version is 4.12.0-0.nightly-2022-11-15-024309

How reproducible:

100%

Steps to Reproduce:

1. Install Node Maintenance operator
2. Go Compute -> Nodes
3. Start maintenance from 3dots menu of worker-0-0
see https://docs.openshift.com/container-platform/4.11/nodes/nodes/eco-node-maintenance-operator.html#eco-setting-node-maintenance-actions-web-console_node-maintenance-operator

Actual results:

No 'Start maintenance' option

Expected results:

Maintenance started successfully

Additional info:

worked for 4.11

 

 

Description of problem:

Backport of https://issues.redhat.com/browse/SDN-3597

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-12153. The following is the description of the original issue:

Description of problem:

When HyperShift HostedClusters are created with "OLMCatalogPlacement" set to "guest" and if the desired release is pre-GA, the CatalogSource pods cannot pull their images due to using unreleased images.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Common

Steps to Reproduce:

1. Create a HyperShift 4.13 HostedCluster with spec.OLMCatalogPlacement = "guest"
2. See the openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff

Actual results:

openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff

Expected results:

All CatalogSource pods to be running and to use n-1 images if pre-GA

Additional info:

 

Description of problem:

The openshift-marketplace namespace was recently updated to add the "audit" and "warn" PSA labels with the value of "restricted", but the "audit-version" and "warn-version" labels were not specified.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Launch any openshift cluster
2. Note missing labels on the openshift-marketplace namespace
3.

Actual results:

Labels do not exist

Expected results:

Labels "warn-version" and "audit-version" set to the current k8s version.

Additional info:

Because these labels are set to the k8s PSA version, when that version is updated, the version labels here (including "enforce-version" label) will need to be updated to match.

Description of problem:

revert "force cert rotation every couple days for development" in 4.13

Below is the steps to verify this bug:

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator
  cluster-kube-apiserver-operator                https://github.com/openshift/cluster-kube-apiserver-operator                7764681777edfa3126981a0a1d390a6060a840a3

# git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307"
08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation

# oc get clusterversions.config.openshift.io 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         64m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133

$ cat scripts/check_secret_expiry.sh
FILE="$1"
if [ ! -f "$1" ]; then
  echo "must provide \$1" && exit 0
fi
export IFS=$'\n'
for i in `cat "$FILE"`
do
  if `echo "$i" | grep "^#" > /dev/null`; then
    continue
  fi
  NS=`echo $i | cut -d ' ' -f 1`
  SECRET=`echo $i | cut -d ' ' -f 2`
  rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null
  echo "Check cert dates of $SECRET in project $NS:"
  openssl x509 -noout --dates -in tls.crt; echo
done

$ cat certs.txt
openshift-kube-controller-manager-operator csr-signer-signer
openshift-kube-controller-manager-operator csr-signer
openshift-kube-controller-manager kube-controller-manager-client-cert-key
openshift-kube-apiserver-operator aggregator-client-signer
openshift-kube-apiserver aggregator-client
openshift-kube-apiserver external-loadbalancer-serving-certkey
openshift-kube-apiserver internal-loadbalancer-serving-certkey
openshift-kube-apiserver service-network-serving-certkey
openshift-config-managed kube-controller-manager-client-cert-key
openshift-config-managed kube-scheduler-client-cert-key
openshift-kube-scheduler kube-scheduler-client-cert-key

Checking the Certs,  they are with one day expiry times, this is as expected.
# ./check_secret_expiry.sh certs.txt
Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:41:38 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of csr-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:52:21 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator:
notBefore=Jun 27 04:41:37 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of aggregator-client in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:49 2022 GMT
notAfter=Jul 27 04:52:50 2022 GMT

Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:28 2022 GMT
notAfter=Jul 27 04:52:29 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT
# 

# cat check_secret_expiry_within.sh
#!/usr/bin/env bash
# usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year
WITHIN=${1:-24hours}
echo "Checking validity within $WITHIN ..."
oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before")  \(.metadata.annotations."auth.openshift.io/certificate-not-after")  \(.metadata.namespace)\t\(.metadata.name)"'

# ./check_secret_expiry_within.sh 1day
Checking validity within 1day ...
2022-06-27T04:41:37Z  2022-06-28T04:41:37Z  openshift-kube-apiserver-operator	aggregator-client-signer
2022-06-27T04:52:26Z  2022-06-28T04:41:37Z  openshift-kube-apiserver	aggregator-client
2022-06-27T04:52:21Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer
2022-06-27T04:41:38Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer-signer

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of problem:

There is a new version of CoreDNS out, 1.10.1 with some desirable fixes.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

oc exec $(oc get pods -n openshift-dns --no-headers | awk '{print $1}' | head -1) -n openshift-dns -c dns -- bash -c "/usr/bin/coredns --version"

Actual results:

CoreDNS-1.10.0
linux/amd64, go1.19.4, 

Expected results:

CoreDNS-1.10.1
linux/amd64, go1.19.4, 

Additional info:

 

Description of problem:

Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.

Version-Release number of selected component (if applicable):

Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107.

Network Type: OVNKubernetes

How reproducible:

Twice out of two attempts.

Steps to Reproduce:

1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1.
   The cluster is up and running with three workers:
   $ oc get clusterversion
   NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
   version   4.11.0-0.nightly-2022-09-19-214532   True        False         51m     Cluster version is 4.11.0-0.nightly-2022-09-19-214532

2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 

3. The upgrade is not succeeds: [0]
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-19-214532   True        True          17h     Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network

One node degrided to 'NotReady,SchedulingDisabled' status:
$ oc get nodes
NAME                          STATUS                        ROLES    AGE   VERSION
ostest-9vllk-master-0         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-1         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-2         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-worker-0-4x4pt   NotReady,SchedulingDisabled   worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-h6kcs   Ready                         worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-xhz9b   Ready                         worker   18h   v1.24.0+3882f8f

$ oc get pods -A | grep -v -e Completed -e Running
NAMESPACE                                          NAME                                                         READY   STATUS      RESTARTS       AGE
openshift-openstack-infra                          coredns-ostest-9vllk-worker-0-4x4pt                          0/2     Init:0/1    0              18h
 
$ oc get events
LAST SEEN   TYPE      REASON                                        OBJECT            MESSAGE
7m15s       Warning   OperatorDegraded: MachineConfigDaemonFailed   /machine-config   Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
7m15s       Warning   MachineConfigDaemonFailed                     /machine-config   Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
baremetal                                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-controller-manager                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-credential                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cluster-autoscaler                         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
config-operator                            4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
console                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
control-plane-machine-set                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
dns                                        4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
etcd                                       4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
image-registry                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     Progressing: The registry is ready...
ingress                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
insights                                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-apiserver                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13
kube-controller-manager                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-scheduler                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-api                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-approver                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-config                             4.11.0-0.nightly-2022-09-19-214532   False       True          True       16h     Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
marketplace                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
monitoring                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
network                                    4.12.0-0.nightly-2022-09-20-040107   True        True          True       19h     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z...
node-tuning                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-apiserver                        4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
openshift-controller-manager               4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-samples                          4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
service-ca                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
storage                                    4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...

[0] http://pastebin.test.redhat.com/1074531

Actual results:

OCP 4.11 --> 4.12 upgrade fails.

Expected results:

OCP 4.11 --> 4.12 upgrade success.

Additional info:

Attached logs of the NotReady node - journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz

Description of problem:

When providing the openshift-install agent create command with installconfig + agentconfig manifests that contain the InstallConfig Proxy section, the Proxy configuration does not get configured cluster-wide.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Define InstallConfig with Proxy section
2.openshift-install agent create image
3.Boot ISO
4.Check /etc/assisted/manifests for agent-cluster-install.yaml to contain the Proxy section 

Actual results:

Missing proxy

Expected results:

Proxy should be present and match with the InstallConfig

Additional info:

 

I am using OCP 4.11.0:

$ oc version
Client Version: 4.10.25
Server Version: 4.11.0
Kubernetes Version: v1.24.0+9546431

I added my private CA certificate to the CA bundle as per documentation: Updating the CA bundle

After that I can see an intermittent error:

$ oc get kubeapiservers.operator.openshift.io cluster -o yaml
…
            lastTransitionTime: "2022-08-27T20:32:39Z"
            message: "alertmanagerconfigs.monitoring.coreos.com: x509: certificate signed by unknown authority"
            reason: WebhookServiceConnectionError
            status: "True"
            type: CRDConversionWebhookConfigurationError
…

In the Kubernetes audit logs, I can see that two controllers (cluster-version-operator and service-ca) are overwriting each other's changes to the alertmanagerconfigs.monitoring.coreos.com crd:

$ oc get crd alertmanagerconfigs.monitoring.coreos.com
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  …
  name: alertmanagerconfigs.monitoring.coreos.com
  …
spec:
  conversion:
    strategy: Webhook
    webhook:
      clientConfig:
        caBundle: LS0tLS1CRUdJTi …
        service:
…

The service-ca controller adds the caBundle field to the crd resource. The cluster-version-operator removes it. This continues periodically.

I reviewed the crd definition from inside of the cluster-version-operator container:

$ oc rsh -n openshift-cluster-version cluster-version-operator-796d5bc86b-52qjw
$ cat /release-manifests/0000_50_cluster-monitoring-operator_00_0alertmanager-config-custom-resource-definition.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.8.0
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    service.beta.openshift.io/inject-cabundle: "true"
  creationTimestamp: null
  name: alertmanagerconfigs.monitoring.coreos.com
spec:
  conversion:
    strategy: Webhook
    webhook:
      clientConfig:
        service:
          name: prometheus-operator-admission-webhook
          namespace: openshift-monitoring
          path: /convert
          port: 8443
      conversionReviewVersions:
      - v1beta1
      - v1alpha1
  group: monitoring.coreos.com
  names:
…

The definition above includes the webhook configuration fields. This is probably the reason why the cluster-version-operator overwrites the changes made by the service-ca controller.

Note that I filed a similar bug report here: https://issues.redhat.com/browse/PSAP-889

Description of problem:

In Agent TUI, setting

IPV6 Configuration to Automatic

and enabling

Require IPV6 addressing for this connection

generates a message saying that the feature is not supported. The user is allowed to quit the TUI (formally correct given that we select 'Quit' from the menu, I wonder if the 'Quit' options should remain greyed out until a valid config is applied? ) and the boot process proceeds using an unsupported/not working network configuration

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-07-131556 

How reproducible:

 

Steps to Reproduce:

1. Feed the agent ISO with an agent-config.yaml file that defines an ipv6 only, static network configuration

2. Boot from the generated agent ISO, wait for the agent TUI to appear, select 'Edit a connection', than change Ipv6 configuration from Manual to Automatic, contextually enable the 'Require IPV6 addressing for this connection' option. Accept the changes.

3. (Not sure if this step is necessary) Once back in the main agent TUI screen, select 'Activate a connection'.
Select the currently active connection, de-activate and re-activate it.

4. Go back to main agent TUI screen, select Quit

Actual results:

The agent TUI displays the following message than quits

Failed to generate network state view: support for multiple default routes not yet implemented in agent-tui

Once the TUI quits, the boot process proceeds

Expected results:

The TUI blocks the possibility to enable unsupported configurations

The agent TUI informs the user about the unsupported configuration the moment it is applied (instead of informing the user the moment he selects 'Quit') and stays opened until a valid network configuration is applied

The TUI should put the boot process on hold until a valid network config is applied

Additional info:

OCP Version: 4.13.0-0.nightly-2023-03-07-131556 

agent-config.yaml snippet

  networkConfig:
    interfaces:
      - name: eno1
        type: ethernet
        state: up
        mac-address: 34:73:5A:9E:59:10
        ipv6:
          enabled: true
          address:
            - ip: 2620:52:0:1eb:3673:5aff:fe9e:5910
              prefix-length: 64
          dhcp: false

Description of problem:

We need to include the `openshift_apps_deploymentconfigs_strategy_total` metrics to the IO archive file.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a cluster
2. Download the IO archive
3. Check the file `config/metrics`
4. You must find `openshift_apps_deploymentconfigs_strategy_total` insde of it

Actual results:

 

Expected results:

You should see the `openshift_apps_deploymentconfigs_strategy_total` at the `config/metrics` file.

Additional info:

 

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
Metal³ is planning to allow these paths in the `name` hint (see OCPBUGS-13080), and assisted's implementation of root device hints (which is used in ZTP and the agent-based installer) should be changed to match.

The following are specific test cases that sporadically have the same timeout error below.

  • CRD extensions.ConsoleNotification CRD: displays YAML editor for creating a new ConsoleNotification instance and creates it
  • CRD extensions.ConsoleExternalLogLink CRD: deletes the ConsoleExternalLogLink instance
  • CRD extensions.ConsoleLink CRD: deletes the ConsoleLink help menu instance
  • CRD extensions.ConsoleCLIDownload CRD: deletes the ConsoleCLIDownload instance
  • CRD extensions.ConsoleLink CRD: displays YAML editor for creating a new ConsoleLink help menu instance and creates it
  • CRD extensions.ConsoleExternalLogLink CRD: displays YAML editor for adding namespaceFilter to the ConsoleExternalLogLink instance
  • CRD extensions.ConsoleLink CRD: displays the ConsoleLink instance in the user menu
  • CRD extensions.ConsoleExternalLogLink CRD: displays YAML editor for creating a new ConsoleExternalLogLink instance and creates it
  • CRD extensions.ConsoleNotification CRD: deletes the ConsoleNotification instance
  • CRD extensions.ConsoleLink CRD: deletes the ConsoleLink user menu instance

https://search.ci.openshift.org/?search=Async+callback+was+not+invoked+within+timeout+specified+by+jasmine.DEFAULT_TIMEOUT_INTERVAL&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

{Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL. exception Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL.
    at Timeout._onTimeout (/go/src/github.com/openshift/console/frontend/node_modules/jasmine/node_modules/jasmine-core/lib/jasmine-core/jasmine.js:4281:23)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7)
   }

 

Run multi-stage test e2e-gcp-console - e2e-gcp-console-test container test expand_less    27m28s
{  l/timers.js:497:7
   
   Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL.
       at listOnTimeout internal/timers.js:554:17
       at processTimers internal/timers.js:497:7
   
   Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL.
       at listOnTimeout internal/timers.js:554:17
       at processTimers internal/timers.js:497:7
   Pending Specs:1) Deploy Image Deploy Image page : should deploy the image and display it in the topology
   Reason: Temporarily disabled with xitSummary:Suites:  44 of 44
Specs:   69 of 108 (1 pending, 38 disabled)
Expects: 206 (3 failures)
Finished in 745.856 secondsLast browser URL:  https://console-openshift-console.apps.ci-op-p2dc39lb-75d12.XXXXXXXXXXXXXXXXXXXXXX/k8s/cluster/customresourcedefinitions?custom-resource-definition-name=ConsoleLink
[09:53:21] I/launcher - 0 instance(s) of WebDriver still running
[09:53:21] I/launcher - chrome #01 failed 1 test(s)
[09:53:21] I/launcher - overall: 1 failed spec(s)
Closing report
[09:53:21] E/launcher - Process exited with error code 1
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
+ copyArtifacts
+ '[' -d /logs/artifacts ']'
+ '[' -d frontend/gui_test_screenshots ']'
++ pwd
+ echo 'Copying artifacts from /go/src/github.com/openshift/console...'
Copying artifacts from /go/src/github.com/openshift/console...
+ cp -r frontend/gui_test_screenshots /logs/artifacts/gui_test_screenshots
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:79","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-10-18T09:53:22Z"}
error: failed to execute wrapped command: exit status 1
}

Run multi-stage test test phase expand_less    27m51s
{  "e2e-gcp-console" pod "e2e-gcp-console-test" failed: the pod ci-op-p2dc39lb/e2e-gcp-console-test failed after 27m49s (failed containers: test): ContainerFailed one or more containers exitedContainer test exited with code 1, reason Error
---
l/timers.js:497:7
   
   Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL.
       at listOnTimeout internal/timers.js:554:17
       at processTimers internal/timers.js:497:7
   
   Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL.
       at listOnTimeout internal/timers.js:554:17
       at processTimers internal/timers.js:497:7
   Pending Specs:1) Deploy Image Deploy Image page : should deploy the image and display it in the topology
   Reason: Temporarily disabled with xitSummary:Suites:  44 of 44
Specs:   69 of 108 (1 pending, 38 disabled)
Expects: 206 (3 failures)
Finished in 745.856 secondsLast browser URL:  https://console-openshift-console.apps.ci-op-p2dc39lb-75d12.XXXXXXXXXXXXXXXXXXXXXX/k8s/cluster/customresourcedefinitions?custom-resource-definition-name=ConsoleLink
[09:53:21] I/launcher - 0 instance(s) of WebDriver still running
[09:53:21] I/launcher - chrome #01 failed 1 test(s)
[09:53:21] I/launcher - overall: 1 failed spec(s)
Closing report
[09:53:21] E/launcher - Process exited with error code 1
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
+ copyArtifacts
+ '[' -d /logs/artifacts ']'
+ '[' -d frontend/gui_test_screenshots ']'
++ pwd
+ echo 'Copying artifacts from /go/src/github.com/openshift/console...'
Copying artifacts from /go/src/github.com/openshift/console...
+ cp -r frontend/gui_test_screenshots /logs/artifacts/gui_test_screenshots
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:79","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-10-18T09:53:22Z"}
error: failed to execute wrapped command: exit status 1
---
Link to step on registry info site: https://steps.ci.openshift.org/reference/test
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=console&branch=master&test=e2e-gcp-console}

Description of problem:

The metal3-ironic container image in OKD fails during steps in configure-ironic.sh that look for additional Oslo configuration entries as environment variables to configure the Ironic instance. The mechanism by which it fails in OKD but not OpenShift is that the image for OpenShift happens to have unrelated variables set which match the regex, because it is based on the builder image, but the OKD image is based only on a stream8 image without these unrelated OS_ prefixed variables set.

The metal3 pod created in response to even a provisioningNetwork: Disabled Provisioning object will therefore crashloop indefinitely.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Deploy OKD to a bare metal cluster using the assisted-service, with the OKD ConfigMap applied to podman play kube, as in :https://github.com/openshift/assisted-service/tree/master/deploy/podman#okd-configuration
2. Observe the state of the metal3 pod in the openshift-machine-api namespace.

Actual results:

The metal3-ironic container repeatedly exits with nonzero, with the logs ending here:

++ export IRONIC_URL_HOST=10.1.1.21
++ IRONIC_URL_HOST=10.1.1.21
++ export IRONIC_BASE_URL=https://10.1.1.21:6385
++ IRONIC_BASE_URL=https://10.1.1.21:6385
++ export IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050
++ IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050
++ '[' '!' -z '' ']'
++ '[' -f /etc/ironic/ironic.conf ']'
++ cp /etc/ironic/ironic.conf /etc/ironic/ironic.conf_orig
++ tee /etc/ironic/ironic.extra
# Options set from Environment variables
++ echo '# Options set from Environment variables'
++ env
++ grep '^OS_'
++ tee -a /etc/ironic/ironic.extra

Expected results:

The metal3-ironic container starts and the metal3 pod is reported as ready.

Additional info:

This is the PR that introduced pipefail to the downstream ironic-image, which is not yet accepted in the upstream:
https://github.com/openshift/ironic-image/pull/267/files#diff-ab2b20df06f98d48f232d90f0b7aa464704257224862780635ec45b0ce8a26d4R3

This is the line that's failing:
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/scripts/configure-ironic.sh#L57

This is the image base that OpenShift uses for ironic-image (before rewriting in ci-operator):
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.ocp#L9

Here is where the relevant environment variables are set in the builder images for OCP:
https://github.com/openshift/builder/blob/973602e0e576d7eccef4fc5810ba511405cd3064/hack/lib/build/version.sh#L87

Here is the final FROM line in the OKD image build (just stream8):
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.okd#L9

This results in the following differences between the two images:
$ podman run --rm -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:519ac06836d972047f311de5e57914cf842716e22a1d916a771f02499e0f235c -c 'env | grep ^OS_'
OS_GIT_MINOR=11
OS_GIT_TREE_STATE=clean
OS_GIT_COMMIT=97530a7
OS_GIT_VERSION=4.11.0-202210061001.p0.g97530a7.assembly.stream-97530a7
OS_GIT_MAJOR=4
OS_GIT_PATCH=0
$ podman run --rm -it --entrypoint bash quay.io/openshift/okd-content@sha256:6b8401f8d84c4838cf0e7c598b126fdd920b6391c07c9409b1f2f17be6d6d5cb -c 'env | grep ^OS_'

Here is what the OS_ prefixed variables should be used for:
https://github.com/metal3-io/ironic-image/blob/807a120b4ce5e1675a79ebf3ee0bb817cfb1f010/README.md?plain=1#L36
https://opendev.org/openstack/oslo.config/src/commit/84478d83f87e9993625044de5cd8b4a18dfcaf5d/oslo_config/sources/_environment.py

It's worth noting that ironic.extra is not consumed anywhere, and is simply being used here to save off the variables that Oslo _might_ be consuming (it won't consume the variables that are present in the OCP builder image, though they do get caught by this regex).

With pipefail set, grep returns non-zero when it fails to find an environment variable that matches the regex, as in the case of the OKD ironic-image builds.

 

Description of problem:

Form footer buttons are misaligned in web terminal form, see attached screenshots

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install web terminal opoerator
2. Login as self provisner user
3. click on terminal icon on page header

Actual results:

Form footter is not aligned properly

Expected results:

Form footer should be properly aligned

Additional info:

 

Description of problem:

The tips for data-test="name-filter-input" can't display full message on Observe -> Targes page, even if user resizes the window or uses another device

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-27-165107

How reproducible:

Always 

Steps to Reproduce:

1. Navigate to Observe -> Metrics targets page
2. Check if the 'aria-label'/'placeholder' information can be shown fully
3.

Actual results:

Text hint is not complete

Expected results:

Compare with other input components, hope the message on Metrics targes page can be shown fully 

Additional info:

 

This is a clone of issue OCPBUGS-8691. The following is the description of the original issue:

Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

aws-ebs-csi-driver-controller
aws-ebs-csi-driver-operator
csi-snapshot-controller
csi-snapshot-webhook


Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

 

This is a clone of issue OCPBUGS-8268. The following is the description of the original issue:

Description of problem:

PipelineRun has Duration column and inside it TaskRun - doesn't

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Have OpenShift Pipeline with 2+ tasks configured and invoked

Steps to Reproduce:

1. Once PipelineRun is invoked - navigate to invoked TaskRuns
2. You will see there columns like Status, Started, but no Duration

Actual results:

 

Expected results:

 

Additional info:

I'll add screenshots for PipelineRuns and TaskRuns

Description of problem:

Customer using a screen reader reports that Console nav toggle button reports expanded in both expanded and not expanded states

Version-Release number of selected component (if applicable):

4.13

How reproducible:

every time

Steps to Reproduce:

1. toggle main nav on and off 
2. note the state of the expanded state doesn't change
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When CMO restarts after all monitoring components are deployed, the event recorder would use the alertmanager-main statefulset as the object's reference in events.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Deploy a cluster
2. Delete the openshift-monitoring/metrics-client-certs secret 
3. oc get events -n openshift-monitoring

Actual results:

Events about certificate generation are related to statefulset/alertmanager-main

Expected results:

Events should be related to deployment/cluster-monitoring-operator

Additional info:

 

 

This is a clone of issue OCPBUGS-9982. The following is the description of the original issue:

Description of problem:

In assisted-installer flow bootkube service is started on Live ISO, so root FS is read-only. OKD installer attempts to pivot the booted OS to machine-os-content via `rpm-ostree rebase`. This is not necessary since we're already using SCOS in Live ISO.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:
Two issues when setting user-defined folder in failureDomain.
1. installer get error when setting folder as a path of user-defined folder in failureDomain.

failureDomains setting in install-config.yaml:

    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-1
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-1
        folder: /IBMCloud/vm/qe-jima
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-2
        folder: /IBMCloud/vm/qe-jima
    - name: us-east-3
      region: us-east
      zone: us-east-3a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-3
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR
        folder: /IBMCloud/vm/qe-jima
    - name: us-west-1
      region: us-west
      zone: us-west-1a
      server: ibmvcenter.vmc-ci.devcluster.openshift.com
      topology:
        datacenter: datacenter-2
        computeCluster: /datacenter-2/host/vcs-mdcnc-workload-4
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR

Error message in terraform after completing ova image import:

DEBUG vsphereprivate_import_ova.import[0]: Still creating... [1m40s elapsed] 
DEBUG vsphereprivate_import_ova.import[3]: Creation complete after 1m40s [id=vm-367860] 
DEBUG vsphereprivate_import_ova.import[1]: Creation complete after 1m49s [id=vm-367863] 
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [1m50s elapsed] 
DEBUG vsphereprivate_import_ova.import[2]: Still creating... [1m50s elapsed] 
DEBUG vsphereprivate_import_ova.import[2]: Still creating... [2m0s elapsed] 
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [2m0s elapsed] 
DEBUG vsphereprivate_import_ova.import[2]: Creation complete after 2m2s [id=vm-367862] 
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [2m10s elapsed] 
DEBUG vsphereprivate_import_ova.import[0]: Creation complete after 2m20s [id=vm-367861] 
DEBUG data.vsphere_virtual_machine.template[0]: Reading... 
DEBUG data.vsphere_virtual_machine.template[3]: Reading... 
DEBUG data.vsphere_virtual_machine.template[1]: Reading... 
DEBUG data.vsphere_virtual_machine.template[2]: Reading... 
DEBUG data.vsphere_virtual_machine.template[3]: Read complete after 1s [id=42054e33-85d6-e310-7f4f-4c52a73f8338] 
DEBUG data.vsphere_virtual_machine.template[1]: Read complete after 2s [id=42053e17-cc74-7c89-f5d1-059c9030ecc7] 
DEBUG data.vsphere_virtual_machine.template[2]: Read complete after 2s [id=4205019f-26d8-f9b4-ac0c-2c073fd70b35] 
DEBUG data.vsphere_virtual_machine.template[0]: Read complete after 2s [id=4205eaf2-c727-c647-ad44-bd9ad7023c56] 
ERROR                                              
ERROR Error: error trying to determine parent targetFolder: folder '/IBMCloud/vm//IBMCloud/vm' not found 
ERROR                                              
ERROR   with vsphere_folder.folder["IBMCloud-/IBMCloud/vm/qe-jima"], 
ERROR   on main.tf line 61, in resource "vsphere_folder" "folder": 
ERROR   61: resource "vsphere_folder" "folder" {   
ERROR                                              
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "pre-bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 
ERROR                                              
ERROR Error: error trying to determine parent targetFolder: folder '/IBMCloud/vm//IBMCloud/vm' not found 
ERROR                                              
ERROR   with vsphere_folder.folder["IBMCloud-/IBMCloud/vm/qe-jima"], 
ERROR   on main.tf line 61, in resource "vsphere_folder" "folder": 
ERROR   61: resource "vsphere_folder" "folder" {   
ERROR                                              
ERROR   

2.  installer get panic error when setting folder as user-defined folder name in failure domains.

failure domain in install-config.yaml

    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-1
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-1
        folder: qe-jima
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-2
        folder: qe-jima
    - name: us-east-3
      region: us-east
      zone: us-east-3a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-3
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR
        folder: qe-jima
    - name: us-west-1
      region: us-west
      zone: us-west-1a
      server: xxx
      topology:
        datacenter: datacenter-2
        computeCluster: /datacenter-2/host/vcs-mdcnc-workload-4
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR                                  

panic error message in installer:

INFO Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/releases/rhcos-4.12/412.86.202208101039-0/x86_64/rhcos-412.86.202208101039-0-vmware.x86_64.ova?sha256=' 
INFO The file was found in cache: /home/user/.cache/openshift-installer/image_cache/rhcos-412.86.202208101039-0-vmware.x86_64.ova. Reusing... 
panic: runtime error: index out of range [1] with length 1goroutine 1 [running]:
github.com/openshift/installer/pkg/tfvars/vsphere.TFVars({{0xc0013bd068, 0x3, 0x3}, {0xc000b11dd0, 0x12}, {0xc000b11db8, 0x14}, {0xc000b11d28, 0x14}, {0xc000fe8fc0, ...}, ...})
    /go/src/github.com/openshift/installer/pkg/tfvars/vsphere/vsphere.go:79 +0x61b
github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1d1ed360, 0x5?)
    /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:847 +0x4798
 

Based on explanation of field folder, looks like folder name should be ok. If it is not allowed to use folder name, need to validate the folder and update explain.

 

sh-4.4$ ./openshift-install explain installconfig.platform.vsphere.failureDomains.topology.folder
KIND:     InstallConfig
VERSION:  v1RESOURCE: <string>
  folder is the name or inventory path of the folder in which the virtual machine is created/located.
 

 

 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-095559

How reproducible:

always

Steps to Reproduce:

see description

Actual results:

installation has errors when set user-defined folder

Expected results:

installation is successful when set user-defined folder

Additional info:

 

Description of problem:

On mobile screens, At pipeline details page the info alert on metrics tab is not showing correctly.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. Create a simple pipeline and run it
2. Go to pipeline details page and move to metrics tab
3. Resize UI as per video attached
4. Info alert is not showing correctly

Actual results:

Info alert is not showing correctly

Expected results:

Info alert should show correctly

Additional info:

 

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO. 

similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103

 

This fix was contributed by lance5890, thanks a lot!

 

Poking around in 4.12.0-ec.3 CI:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1570267348072402944/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json | jq -r '.items[] | .metadata.name as $n | .spec.tolerations[] | select(.key == "node.kubernetes.io/not-reachable") | $n'
console-7fffd859d6-j784q
console-7fffd859d6-m8fgj
downloads-8449c756f8-47ppj
downloads-8449c756f8-b7w26
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1570267348072402944/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json | jq -r '.items[] | .metadata.name as $n | select($n | startswith("console-") or startswith("downloads-")).spec.tolerations[] | $n + " " + tostring' | grep -v console-operator
console-7fffd859d6-j784q {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
console-7fffd859d6-j784q {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
console-7fffd859d6-j784q {"effect":"NoExecute","key":"node.kubernetes.io/not-reachable","operator":"Exists","tolerationSeconds":120}
console-7fffd859d6-j784q {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":300}
console-7fffd859d6-j784q {"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}
console-7fffd859d6-m8fgj {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
console-7fffd859d6-m8fgj {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
console-7fffd859d6-m8fgj {"effect":"NoExecute","key":"node.kubernetes.io/not-reachable","operator":"Exists","tolerationSeconds":120}
console-7fffd859d6-m8fgj {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":300}
console-7fffd859d6-m8fgj {"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}
downloads-8449c756f8-47ppj {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
downloads-8449c756f8-47ppj {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
downloads-8449c756f8-47ppj {"effect":"NoExecute","key":"node.kubernetes.io/not-reachable","operator":"Exists","tolerationSeconds":120}
downloads-8449c756f8-47ppj {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":300}
downloads-8449c756f8-47ppj {"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}
downloads-8449c756f8-b7w26 {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
downloads-8449c756f8-b7w26 {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
downloads-8449c756f8-b7w26 {"effect":"NoExecute","key":"node.kubernetes.io/not-reachable","operator":"Exists","tolerationSeconds":120}
downloads-8449c756f8-b7w26 {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":300}
downloads-8449c756f8-b7w26 {"effect":"NoSchedule","key":"node.kubernetes.io/memory-pressure","operator":"Exists"}

node.kubernetes.io/unreachable is a well-known taint. But I haven't noticed node.kubernetes.io/not-reachable before. It seems like these console operands are the only pods to mention it. And it seems to have entered the console in co#224 without much motivational context (but I may just have missed finding a thread somewhere where the motivation was discussed).

I don't think the toleration will cause any problems, but to avoid use confusion (as I experienced before working up this ticket), it is probably worth removing node.kubernetes.io/not-reachable, both from new clusters created after the fix lands, and in old clusters born before the fix and updated into the fixed release. Both of those use-cases should be available in presubmit CI for console-operator changes.

Description of problem:

On clusters serving Route via CRD (i.e. MicroShift), Route validation does not perform the same validation as on OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

$ cat<<EOF | oc apply --server-side -f-
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-microshift
spec:
  to:
    kind: Service
    name: hello-microshift
EOF

route.route.openshift.io/hello-microshift serverside-applied

$ oc get route hello-microshift -o yaml

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  annotations:
    openshift.io/host.generated: "true"
  creationTimestamp: "2022-11-11T23:53:33Z"
  generation: 1
  name: hello-microshift
  namespace: default
  resourceVersion: "2659"
  uid: cd35cd20-b3fd-4d50-9912-f34b3935acfd
spec:
  host: hello-microshift-default.cluster.local
  to:
    kind: Service
    name: hello-microshift
  wildcardPolicy: None

$ cat<<EOF | oc apply --server-side -f-
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-microshift
spec:
  to:
    kind: Service
    name: hello-microshift
  wildcardPolicy: ""
EOF

Actual results:

route.route.openshift.io/hello-microshift serverside-applied

Expected results:

The Route "hello-microshift" is invalid: spec.wildcardPolicy: Invalid value: "": field is immutable 

Additional info:

** This change will be inert on OCP, which already has the correct behavior. **

Description of problem:

While configuring 4.12.0 dualstack baremetal cluster ovs-configuration.service fails
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: Attempt 10 to bring up connection ovs-if-phys1
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + nmcli conn up ovs-if-phys1
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[26588]: Error: Connection activation failed: No suitable device found for this connection (device eno1np0 not available because profile i
s not compatible with device (mismatching interface name)).
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + s=4
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + sleep 5
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + '[' 4 -eq 0 ']'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + false
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + echo 'ERROR: Cannot bring up connection ovs-if-phys1 after 10 attempts'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: ERROR: Cannot bring up connection ovs-if-phys1 after 10 attempts
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + return 4
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + handle_exit
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + e=4
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + '[' 4 -eq 0 ']'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + echo 'ERROR: configure-ovs exited with error: 4'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: ERROR: configure-ovs exited with error: 4

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

So far 100%

Steps to Reproduce:

1. Deploy dualstack baremetal cluster with bonded interfaces(configured with MC and not NMState within install-config.yaml)
2. Run migration to second interface, part of machine config
      - contents:
          source: data:text/plain;charset=utf-8,bond0.117
        filesystem: root
        mode: 420
        path: /etc/ovnk/extra_bridge
3. Install operators:
* kubevirt-hyperconverged
* sriov-network-operator
* cluster-logging
* elasticsearch-operator
4. Start applying node-tunning profiles
5. During node reboots ovs-configuration service fails

Actual results:

ovs-configuration service fails on some nodes resulting in ovnkube-node-* pods failure
oc get po -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS          AGE
ovnkube-master-dvgx7   6/6     Running            8                 16h
ovnkube-master-vs7mp   6/6     Running            6                 16h
ovnkube-master-zrm4c   6/6     Running            6                 16h
ovnkube-node-2g8mb     4/5     CrashLoopBackOff   175 (3m48s ago)   16h
ovnkube-node-bfbcc     4/5     CrashLoopBackOff   176 (64s ago)     16h
ovnkube-node-cj6vf     5/5     Running            5                 16h
ovnkube-node-f92rm     5/5     Running            5                 16h
ovnkube-node-nmjpn     5/5     Running            5                 16h
ovnkube-node-pfv5z     4/5     CrashLoopBackOff   163 (4m53s ago)   15h
ovnkube-node-z5vf9     5/5     Running            10                15h

Expected results:

ovs-configuration service succeeds on all nodes

Additional info:


Description of problem:

When setting the allowedregistries like the example below, the openshift-samples operator is degraded:

oc get image.config.openshift.io/cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Image
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: "2020-12-16T15:48:20Z"
  generation: 2
  name: cluster
  resourceVersion: "422284920"
  uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5
spec:
  additionalTrustedCA:
    name: registry-ca
  allowedRegistriesForImport:
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
  - domainName: registry.redhat.io/redhat/redhat-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/redhat-marketplace-index
    insecure: true
  - domainName: registry.redhat.io/redhat/certified-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/community-operator-index
    insecure: true
  registrySources:
    allowedRegistries:
    - quay.io
    - registry.redhat.io
    - registry.rijksapps.nl
    - registry.access.redhat.com
    - registry.redhat.io/redhat/redhat-operator-index
    - registry.redhat.io/redhat/redhat-marketplace-index
    - registry.redhat.io/redhat/certified-operator-index
    - registry.redhat.io/redhat/community-operator-index


oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.21   True        False         False      5d13h   
baremetal                                  4.10.21   True        False         False      450d    
cloud-controller-manager                   4.10.21   True        False         False      94d     
cloud-credential                           4.10.21   True        False         False      624d    
cluster-autoscaler                         4.10.21   True        False         False      624d    
config-operator                            4.10.21   True        False         False      624d    
console                                    4.10.21   True        False         False      42d     
csi-snapshot-controller                    4.10.21   True        False         False      31d     
dns                                        4.10.21   True        False         False      217d    
etcd                                       4.10.21   True        False         False      624d    
image-registry                             4.10.21   True        False         False      94d     
ingress                                    4.10.21   True        False         False      94d     
insights                                   4.10.21   True        False         False      104s    
kube-apiserver                             4.10.21   True        False         False      624d    
kube-controller-manager                    4.10.21   True        False         False      624d    
kube-scheduler                             4.10.21   True        False         False      624d    
kube-storage-version-migrator              4.10.21   True        False         False      31d     
machine-api                                4.10.21   True        False         False      624d    
machine-approver                           4.10.21   True        False         False      624d    
machine-config                             4.10.21   True        False         False      17d     
marketplace                                4.10.21   True        False         False      258d    
monitoring                                 4.10.21   True        False         False      161d    
network                                    4.10.21   True        False         False      624d    
node-tuning                                4.10.21   True        False         False      31d     
openshift-apiserver                        4.10.21   True        False         False      42d     
openshift-controller-manager               4.10.21   True        False         False      22d     
openshift-samples                          4.10.21   True        True          True       31d     Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}
operator-lifecycle-manager                 4.10.21   True        False         False      624d    
operator-lifecycle-manager-catalog         4.10.21   True        False         False      624d    
operator-lifecycle-manager-packageserver   4.10.21   True        False         False      31d     
service-ca                                 4.10.21   True        False         False      624d    
storage                                    4.10.21   True        False         False      113d  


After applying the fix as described here(  https://access.redhat.com/solutions/6547281 ) it is resolved:
oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}'

But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.21   True        False         31d     Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
OpenShift installer hits error when missing a topology section inside of a failureDomain like this in install-config.yaml:

    - name: us-east-1
      region: us-east
      zone: us-east-1a
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      topology:
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        networks:
        - ci-segment-154
        datastore: workload_share_vcsmdcncworkload2_vyC6a

Version-Release number of selected component (if applicable):

Build from latest master (4.12)

How reproducible:

Each time

Steps to Reproduce:

1. Create install-config.yaml for vsphere multi-zone
2. Leave out a topology section (under failureDomains)
3. Attempt to create cluster

Actual results:

FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.vsphere.failureDomains.topology.resourcePool: Invalid value: "//Resources": resource pool '//Resources' not found 

Expected results:

Validation of topology before attempting to create any resources

Description of problem:

The operator recommended namespace is incorrect after change installation mode to "A specific namespace on the cluster"
When both annotation operatorframework.io/suggested-namespace-template & operatorframework.io/suggested-namespace are defined in CSV. The Operator recommended Namespace should use the value difined in suggested-namespace-template

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-12-05-155739

How reproducible:

Always

Steps to Reproduce:

1. Create Catalog source
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test
  namespace: openshift-marketplace
spec:
  displayName: test
  image: 'quay.io/xiyuzhao/flux-operator-index:latest'   
  sourceType: grpc
2. Go to page /operatorhub/subscribe?pkg=flux&catalog=test55684&catalogNamespace=openshift-marketplace&targetNamespace=openshift-marketplace 
3. Change Installation mode to "A specific namespace on the cluster"
4. Check if the Operator recommended Namespace is the same value (testxi3210) defined in operatorframework.io/suggested-namespace-template

Actual results:

The Operator recommended Namespace is using the value that defined in operatorframework.io/suggested-namespace

Expected results:

The value should use operatorframework.io/suggested-namespace-template, but not operatorframework.io/suggested-namespace value

Additional info:

CSV definition on annotation section:
    operatorframework.io/suggested-namespace-template: >-            {"kind":"Namespace","apiVersion":"v1","metadata":{"name":"testxi3210","labels":{"foo":"testxi3120"},"annotations":{"openshift.io/node-selector":"","baz":"testxi3120"}},"spec":{"finalizers":["kubernetes"]}}
    operatorframework.io/suggested-namespace: flux-system

Based UserStory: https://issues.redhat.com/browse/CONSOLE-3120

This is a clone of issue OCPBUGS-9949. The following is the description of the original issue:

Description of problem:

When creating an image for arm, i.e. using:
  architecture: arm64

and running
$ ./bin/openshift-install agent create image --dir ./cluster-manifests/ --log-level debug

the output indicates the the correct base iso was extracted from the release:
INFO Extracting base ISO from release payload     
DEBUG Using mirror configuration                   
DEBUG Fetching image from OCP release (oc adm release info --image-for=machine-os-images --insecure=true --icsp-file=/tmp/icsp-file347546417 registry.ci.openshift.org/origin/release:4.13) 
DEBUG extracting /coreos/coreos-aarch64.iso to /home/bfournie/.cache/agent/image_cache, oc image extract --path /coreos/coreos-aarch64.iso:/home/bfournie/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file3609464443 registry.ci.openshift.org/origin/4.13-2023-03-09-142410@sha256:e3c4445cabe16ca08c5b874b7a7c9d378151eb825bacc90e240cfba9339a828c 
INFO Base ISO obtained from release and cached at /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso 
DEBUG Extracted base ISO image /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso from release payload 

When in fact the ISO was not extracted from the release image and the command failed:
ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors 
FATAL failed to fetch Agent Installer ISO: failed to generate asset "Agent Installer ISO": provided device /home/bfournie/.cache/agent/image_cache/coreos-aarch64.iso does not exist

Version-Release number of selected component (if applicable):

4.13

How reproducible:

every time

Steps to Reproduce:

1. Set architecture: arm64  for all hosts in install-config.yaml 
2. Run the openshift-install command as above
3. See the log messages and the command fails

Actual results:

Invalid messages are logged and command fails

Expected results:

Command succeeds

Additional info:

 

This is a clone of issue OCPBUGS-10148. The following is the description of the original issue:

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/61

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-8310. The following is the description of the original issue:

Bump to pick up fixes.

Description of problem:

The Machine API provider for Azure sets the MachineConfig.ObjectMeta.Name to the cluster name. The value of this field was never actually used anywhere, but was mistakenly brought across in a refactor of the machine scope

It causes a diff between the defaulted machines and the machines once the actuator has seen them, which in turn is causing issues with CPMS.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create a machine with providerSpec.metadata.name unset
2. 
3.

Actual results:

Name gets populated to cluster name

Expected results:

Name should not be populated

Additional info:

 

Description of the problem:

In integration, [BE version - master code]  - In the middle of cluster installation, hosts were deregistered, which caused a huge amount of events to be sent

How reproducible:

100%

Steps to reproduce:

1. Start cluster installation

2. After 2 master nodes joined, send delete cluster using API (curl -X DELETE https://api.integration.openshift.com/api/assisted-install/v2/clusters/be462ad2-7b7e-4549-be2d-1c591da0fa6d --header "Authorization: Bearer $(ocm token)" -H "Content-Type: application/json"

{"code":"404","href":"","id":404,"kind":"Error","reason":"cluster be462ad2-7b7e-4549-be2d-1c591da0fa6d can not be removed while being installed"}

)

3. Hosts are deregistered and cluster in error

Actual results:

 

Expected results:
hosts should not be able to deregister during install, number of events is too high

This is a clone of issue OCPBUGS-12435. The following is the description of the original issue:

Description of problem:

If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.

Version-Release number of selected component (if applicable):

4.11 (originally reproduced on 4.9)

How reproducible:

Always

Steps to Reproduce:

1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP.
2.
3.

Actual results:

Error, DNS resolution not completed.

Expected results:

Request retried via TCP and succeeded.

Additional info:

In comments.

Description of the problem:

 disk-encryption-requirements-satisfied pending message is uninformative due to missing backend logic

How reproducible:

 Unknown

Steps to reproduce:

1. See MGMT-13461

Actual results:

 Validation pending message says "Unexpected status pending"

Expected results:

 Validation pending message should be more verbose and not this weird internal error thing. Something along the lines of "Waiting for Tang connectivity check to complete"

This is a clone of issue OCPBUGS-11450. The following is the description of the original issue:

Description of problem:

When CNO is managed by Hypershift, it's deployment has "hypershift.openshift.io/release-image" template metadata annotation. The annotation's value is used to track progress of cluster control plane version upgrades. But multus-admission-controller created and managed by CNO does not have that annotation so service providers are not able to track its version upgrades.

The proposed solution is for CNO to propagate its "hypershift.openshift.io/release-image" annotation down to the multus-admission-controller deployment. For that CNO need to have "get" access to its own deployment manifest to be able to read the deployment template metadata annotations. 

Hypershift needs code change to assign CNO "get" permission on the CNO deployment object.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Check deployment template metadata annotations on multus-admission-controller

Actual results:

No "hypershift.openshift.io/release-image" deployment template metadata annotation exists 

Expected results:

"hypershift.openshift.io/release-image" annotation must be present

Additional info:

 

Description of problem:

When creating services in a OVN-HybridOverlay cluster with Windows workers, we are experiencing intermittent reachability issues for the external-ip when the number of pods from the expose deployment is bigger than 1:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 
NAME            TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
win-webserver   LoadBalancer   172.30.38.192   34.136.170.199   80:30246/TCP   41m

cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get deploy -n winc-38186 
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
win-webserver   6/6     6            6           42m

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get pods -n winc-38186 
NAME                             READY   STATUS    RESTARTS   AGE
win-webserver-597fb4c9cc-8ccwg   1/1     Running   0          6s
win-webserver-597fb4c9cc-f54x5   1/1     Running   0          6s
win-webserver-597fb4c9cc-jppxb   1/1     Running   0          97s
win-webserver-597fb4c9cc-twn9b   1/1     Running   0          6s
win-webserver-597fb4c9cc-x5rfr   1/1     Running   0          6s
win-webserver-597fb4c9cc-z8sfv   1/1     Running   0          6s

[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out
[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out

When having a look at the Load Balancer service, we can see that the externalTrafficPolicy is of type "Cluster":

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 win-webserver -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-11-25T13:29:00Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app: win-webserver
  name: win-webserver
  namespace: winc-38186
  resourceVersion: "169364"
  uid: 4a229123-ee88-47b6-99ce-814522803ad8
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.38.192
  clusterIPs:
  - 172.30.38.192
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - nodePort: 30246
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: win-webserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 34.136.170.199


Recreating the Service setting externalTrafficPolicy to Local seems to solve the issue:  $ oc describe svc win-webserver -n winc-38186
Name:                     win-webserver
Namespace:                winc-38186
Labels:                   app=win-webserver
Annotations:              <none>
Selector:                 app=win-webserver
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.38.192
IPs:                      172.30.38.192
LoadBalancer Ingress:     34.136.170.199
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  30246/TCP
Endpoints:                10.132.0.18:80,10.132.0.19:80,10.132.0.20:80 + 3 more...
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                 Age                 From                Message
  ----    ------                 ----                ----                -------
  Normal  ExternalTrafficPolicy  66m                 service-controller  Cluster -> Local
  Normal  EnsuringLoadBalancer   63m (x3 over 113m)  service-controller  Ensuring load balancer
  Normal  ExternalTrafficPolicy  63m                 service-controller  Local -> Cluster
  Normal  EnsuredLoadBalancer    62m (x3 over 113m)  service-controller  Ensured load balancer 

$ oc get svc -n winc-test
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)          AGE
linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87   8080:30715/TCP   152m
win-check         LoadBalancer   172.30.50.151   35.194.12.34   80:31725/TCP     4m33s
win-webserver     LoadBalancer   172.30.15.95    35.226.129.1   80:30409/TCP     152m
[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
<html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34

While the other service which has externalTrafficPolicy set to "Cluster" is still failing:

[cloud-user@preserve-jfrancoa tmp]$ curl 35.226.129.1
curl: (7) Failed to connect to 35.226.129.1 port 80: Connection timed out

 

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-11-24-203151   True        False         7h2m    Cluster version is 4.12.0-0.nightly-2022-11-24-203151


$ oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2022-11-25T06:56:50Z"
  generation: 2
  name: cluster
  resourceVersion: "2952"
  uid: e9ad729c-36a4-4e71-9a24-740352b11234
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy: {}
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
status:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  clusterNetworkMTU: 1360
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16

How reproducible:

Always, sometimes it takes more curl calls to the External IP, but it always ends up timeouting

Steps to Reproduce:

1. Deploy a Windows cluster with OVN-Hybrid overlay on GCP, the following Jenkins job can be used for it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/158926/
2. Create a deployment and a service, for example:
kind: Service
metadata:
  labels:
    app: win-check
  name: win-check
  namespace: winc-test
spec:
  #externalTrafficPolicy: Local
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: win-check
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-check
  name: win-check
  namespace: winc-test
spec:
  replicas: 6
  selector:
    matchLabels:
      app: win-check
  template:
    metadata:
      labels:
        app: win-check
      name: win-check
    spec:
      containers:
      - command:
        - pwsh.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/');
          $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening)
          { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows
          Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content);
          $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer,
          0, $buffer.Length); $response.Close(); };
        image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
        name: win-check
        securityContext:
          runAsNonRoot: false
          windowsOptions:
            runAsUserName: ContainerAdministrator
      nodeSelector:
        kubernetes.io/os: windows
      tolerations:
      - key: os
        value: Windows
  3.Get the external IP for the service: 
$ oc get svc -n winc-test                                                   
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE                                            
linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87     8080:30715/TCP   94m                                            
win-check         LoadBalancer   172.30.82.251   35.239.175.209   80:30530/TCP     29s                                            
win-webserver     LoadBalancer   172.30.15.95    35.226.129.1     80:30409/TCP     94m

  4. Try to curl the external-ip:
$ curl 35.239.175.209
curl: (7) Failed to connect to 35.239.175.209 port 80: Connection timed out

 

Actual results:

The Load Balancer IP is not reachable, thus impacting in the service availability

Expected results:

The Load Balancer IP is available at all times

Additional info:

 

This is a clone of issue OCPBUGS-8530. The following is the description of the original issue:

Description of problem:

The e2e-nutanix test run failed at bootstrap stage when testing the PR https://github.com/openshift/cloud-provider-nutanix/pull/7. Could reproduce the bootstrap failure with the manual testing to create a Nutanix OCP cluster with the latest nutanix-ccm image.

time="2023-03-06T12:25:56-05:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2023-03-06T12:25:56-05:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
time="2023-03-06T12:25:56-05:00" level=warning msg="The bootstrap machine is unable to resolve API and/or API-Int Server URLs" 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

From the PR https://github.com/openshift/cloud-provider-nutanix/pull/7, trigger the e2e-nutanix test. The test will fail at bootstrap stage with the described errors.

Actual results:

The e2e-nutanix test run failed at bootstrapping with the errors: 

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

Expected results:

The e2e-nutanix test will pass

Additional info:

Investigation showed the root cause was the Nutanix cloud-controller-manager pod did not have permission to get/list ConfigMap resource. The error logs from the Nutanix cloud-controller-manager pod:

E0307 16:08:31.753165       1 reflector.go:140] pkg/provider/client.go:124: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope
I0307 16:09:30.050507       1 reflector.go:257] Listing and watching *v1.ConfigMap from pkg/provider/client.go:124
W0307 16:09:30.052278       1 reflector.go:424] pkg/provider/client.go:124: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope
E0307 16:09:30.052308       1 reflector.go:140] pkg/provider/client.go:124: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:openshift-cloud-controller-manager:cloud-controller-manager" cannot list resource "configmaps" in API group "" at the cluster scope 

Description of problem:

openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors:

}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
W1018 11:36:43.523673      15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to {
  "Addr": "[2620:52:0:198::10]:2379",
  "ServerName": "2620:52:0:198::10",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-18-041406

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with single stack IPv6 via ZTP procedure

Actual results:

Deployment times out and some of the operators aren't deployed successfully.

NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-10-18-041406   False       False         True       124m    APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
baremetal                                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      112m    
cloud-controller-manager                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
cloud-credential                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
cluster-autoscaler                         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
config-operator                            4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
console                                                                                                                      
control-plane-machine-set                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
dns                                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
etcd                                       4.12.0-0.nightly-2022-10-18-041406   True        False         True       121m    ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries
image-registry                             4.12.0-0.nightly-2022-10-18-041406   False       True          True       104m    Available: The registry is removed...
ingress                                    4.12.0-0.nightly-2022-10-18-041406   True        True          True       111m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
insights                                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      118s    
kube-apiserver                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      102m    
kube-controller-manager                    4.12.0-0.nightly-2022-10-18-041406   True        False         True       107m    GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused
kube-scheduler                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
kube-storage-version-migrator              4.12.0-0.nightly-2022-10-18-041406   True        False         False      117m    
machine-api                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-approver                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-config                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
marketplace                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      116m    
monitoring                                                                      False       True          True       98m     deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found
network                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
node-tuning                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
openshift-apiserver                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      104m    
openshift-controller-manager               4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
openshift-samples                                                               False       True          False      103m    The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state
operator-lifecycle-manager                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-10-18-041406   True        False         False      106m    
service-ca                                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
storage                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m  

Expected results:

Deployment succeeds without issues.

Additional info:

I was unable to run must-gather so attaching the pods logs copied from the host file system.

Description of problem:
During the cluster destroy process for IBM Cloud IPI, failures can occur when COS Instances are deleted, but Reclamations are created for the COS deletions, and prevent cleanup of the ResourceGroup

Version-Release number of selected component (if applicable):
4.13.0 (and 4.12.0)

How reproducible:
Sporadic, it depends on IBM Cloud COS

Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud
2. Delete the IPI cluster on IBM Cloud
3. COS Reclamation may be created, and can cause the destroy cluster to fail

Actual results:

time="2022-12-12T16:50:06Z" level=debug msg="Listing resource groups"
time="2022-12-12T16:50:06Z" level=debug msg="Deleting resource group \"eu-gb-reclaim-1-zc6xg\""
time="2022-12-12T16:50:07Z" level=debug msg="Failed to delete resource group eu-gb-reclaim-1-zc6xg: Resource groups with active or pending reclamation instances can't be deleted. Use the CLI commands \"ibmcloud resource service-instances --type all\" and \"ibmcloud resource reclamations\" to check for remaining instances, then delete the instances and try again."

Expected results:
Successful destroy cluster (including deletion of ResourceGroup)

Additional info:
IBM Cloud is testing a potential fix currently.

It was also identified, the destroy stages are not in a proper order.
https://github.com/openshift/installer/blob/9377cb3974986a08b531a5e807fd90a3a4e85ebf/pkg/destroy/ibmcloud/ibmcloud.go#L128-L155

Changes are being made in an attempt to resolve this along with a fix for this bug as well.

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/187

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:
RWN (day2 worker node) fails to deploy and join its spoke cluster (Spoke cluster does complete deployment).

RWN agent reports:
" state: insufficient
stateInfo: 'Host cannot be installed due to following failing validation(s): Ignition
is not downloadable. Please ensure host connectivity to the cluster''s API'"
Release version:
Issue occurs with OCP 4.11, 4.10. and 4.9 spoke clusters, but only on 4.11 + 2.6 hub clusters. (4.10 / 2.5 and 4.9 / 2.4 does not have this issue - not sure about other combinations as this is showing up in CI).

Operator snapshot version:
MCE - 2.1.0-DOWNANDBACK-2022-08-29-21-31-35

OCP version:
Hub - registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-09-02-184920
How reproducible:

100%

Steps to reproduce:
1. Deploy 4.11 hub cluster, 2.6 ACM/2.1 MCE, in ipv6 disconnected proxy env
2. Deploy ipv6 disconnected 4.11 (or 4.10 or 4.9) multinode proxy spoke cluster
3. Try to deploy RWN to join spoke
Actual results:
RWN agent fails with stateInfo: 'Host cannot be installed due to following failing validation(s): Ignition
is not downloadable. Please ensure host connectivity to the cluster''s API'"
Expected results:
RWN agent joins cluster

Originally reported in BZ2124720 but will be tracking this bug going forward.

We want to add the dual-stack tests to the CNI plugin conformance test suite, for the currently supported releases.

(This has no impact on OpenShift itself. We're just modifying a test suite that OCP does not use.)

I have a script that does continuous installs using AGENT_E2E_TEST_SCENARIO=COMPACT_IPV4, just starting a new install after the previous one completes. What I'm seeing is that eventually I end up getting installation failures due to the container-images-available validation failure. What gets logged in wait-for bootstrap-complete is:

level=debug msg=Host master-0: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6. result: failure. 

level=debug msg=Host master-0: validation 'container-images-available' that used to succeed is now failing
level=debug msg=Host master-0: updated status from preparing-for-installation to preparing-failed (Host failed to prepare for installation due to following failing validation(s): Failed to fetch container images needed for installation from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6. This may be due to a network hiccup. Retry to install again. If this problem persists, check your network settings to make sure you’re not blocked. ; Host couldn't synchronize with any NTP server)

Sometimes the image gets loaded onto the other masters OK and sometimes there are failures with more than one host. In either case the install stalls at this point.

When using a disconnected environment (MIRROR_IMAGES=true) I don't see this occurring.

Containers on host0
[core@master-0 ~]$ sudo podman ps
CONTAINER ID  IMAGE                                                                                                                   COMMAND               CREATED       STATUS           PORTS       NAMES
00a0eebb989c  localhost/podman-pause:4.2.0-1661537366                                                                                                       11 hours ago  Up 11 hours ago              cef65dd7f170-infra
5d0eced94979  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:caa73897dcb9ff6bc00a4165f4170701f4bd41e36bfaf695c00461ec65a8d589  /bin/bash start_d...  11 hours ago  Up 11 hours ago              assisted-db
813bef526094  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:caa73897dcb9ff6bc00a4165f4170701f4bd41e36bfaf695c00461ec65a8d589  /assisted-service     11 hours ago  Up 11 hours ago              service
edde1028a542  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e43558e28be8fbf6fe4529cf9f9beadbacbbba8c570ecf6cb81ae732ec01807f  next_step_runner ...  11 hours ago  Up 11 hours ago              next-step-runner

Some relevant logs from assisted-service for this container image:
time="2022-11-03T01:48:44Z" level=info msg="Submitting step <container-image-availability> id <container-image-availability-b72665b1> to infra_env <17c8b837-0130-4b8c-ad06-19bcd2a61dbf> host <df170326-772b-43b5-87ef-3dfff91ba1a9>  Arguments: <[{\"images\":[\"registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca122ab3a82dfa15d72a05f448c48a7758a2c7b0ecbb39011235bcf0666fbc15\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e52a45b47cd9d70e7378811f4ba763fd43ec2580378822286c7115fbee6ef3a\"],\"timeout\":960}]>" func=github.com/openshift/assisted-service/internal/host/hostcommands.logSteps file="/src/internal/host/hostcommands/instruction_manager.go:285" go-id=841 host_id=df170326-772b-43b5-87ef-3dfff91ba1a9 infra_env_id=17c8b837-0130-4b8c-ad06-19bcd2a61dbf pkg=instructions request_id=47cc221f-4f47-4d0d-8278-c0f5af933567

time="2022-11-03T01:49:35Z" level=error msg="Received step reply <container-image-availability-9788cfa7> from infra-env <17c8b837-0130-4b8c-ad06-19bcd2a61dbf> host <845f1e3c-c286-4d2f-ba92-4c5cab953641> exit-code <2> stderr <> stdout <{\"images\":[

{\"name\":\"registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451\",\"result\":\"success\"}

,{\"download_rate\":159.65409925994226,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca122ab3a82dfa15d72a05f448c48a7758a2c7b0ecbb39011235bcf0666fbc15\",\"result\":\"success\",\"size_bytes\":523130669,\"time\":3.276650405},{\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6\",\"result\":\"failure\"},{\"download_rate\":278.8962416008878,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e52a45b47cd9d70e7378811f4ba763fd43ec2580378822286c7115fbee6ef3a\",\"result\":\"success\",\"size_bytes\":402688178,\"time\":1.443863767}]}>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/src/internal/bminventory/inventory.go:3287" go-id=845 host_id=845f1e3c-c286-4d2f-ba92-4c5cab953641 infra_env_id=17c8b837-0130-4b8c-ad06-19bcd2a61dbf pkg=Inventory request_id=3a571ba6-5175-4bbe-b89a-20cdde30b884                         

time="2022-11-03T01:49:35Z" level=info msg="Adding new image status for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6 with status failure to host 845f1e3c-c286-4d2f-ba92-4c5cab953641" func="github.com/openshift/assisted-service/internal/host.(*Manager).UpdateImageStatus" file="/src/internal/host/host.go:805" pkg=host-state

 

Description of problem:
When running node-density on a 120 node cluster, we see some spikes in pod
ready latency times. These spikes correspond to a southbound DB compaction. During this compaction time the ovn-controller is not able to connect to a leader.

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-03-20-160505

How reproducible:
Always

Steps to Reproduce:
1. Run node-density-light on a 120 node vluster

Actual results:
Pod ready latency spikes which cause the P99 to go up

Expected results:
A steady pod ready latency during the test

Description of the problem:

While installing LVMS + CNV on 4.12 , once finalizing stage reached , lvms operator start progressing , in infra logs i see following:

 Asked operators to be in one of the statuses from ['available', 'failed'] and currently operators statuses are [('cnv', None, None), ('lvm', 'available', 'install strategy completed with no errors')] 

Status of cnv remains None , for 40 minutes.

After 40 minutes it switch to "prgressing" and few minutes later cnv is installed

Also i see this on the cluster event:

1/23/2023, 4:51:38 PM	Operator cnv status: available message: install strategy completed with no errors
1/23/2023, 4:50:38 PM	Operator cnv status: progressing message: installing: waiting for deployment virt-operator to become ready: deployment "virt-operator" not available: Deployment does not have minimum availability.
1/23/2023, 4:10:08 PM	Operator lvm status: available message: install strategy completed with no errors

 

How reproducible:

90%

Steps to reproduce:

1.Create sno cluster with ocp 4.12

2.choose cnv operator

3. start installation

Actual results:
once cluster reaches finalizing state lvms start progressing and install in few mintes , after that cnv remains in None status for 40 minutes before turning to progress and then available
 

Expected results:

Should not stay as long on None status

These commented out tests https://github.com/openshift/origin/blob/master/test/extended/testdata/cmd/test/cmd/templates.sh#L130-L149 are problematic, because they are testing rather important functionality of cross-namespace template processing.

This problem recently escalated after landing k8s 1.25, where there was a suspicion that new version of kube-apiserver removed that functionality. We need to bring back this test, as well as similar tests which are touching logging in functionality. https://github.com/openshift/origin/blob/master/test/extended/testdata/cmd/test/cmd/authentication.sh is another similar test being skipped due to similar reasons.

Based on my search: https://github.com/openshift/origin/blob/master/test/extended/oauth/helpers.go#L18 we could deploy Basic Auth Provider ie. password based, and group all tests relying on this functionality under a single umbrella.

The biggest question to answer is how we can properly deal with multiple IdentityProviders, so I'd suggest reaching out to Auth team for help.

The second problem that was identified is various cloud providers, so we've agreed to run this test initially only on AWS and GCP.

This is a clone of issue OCPBUGS-8044. The following is the description of the original issue:

Description of problem:

rhbz#2089199, backported to 4.11.5, shifted the etcd Grafana dashboard from the monitoring operator to the etcd operator. During the shift, the ConfigMap was renamed from grafana-dashboard-etcd to etcd-dashboard. However, we did not include logic for garbage-collecting the obsolete dasboard, so clusters that update from 4.11.1 and similar into 4.11.>=5 or 4.12+ currently end up with both the obsolete and new ConfigMaps. We should grow code to remove the obsolete ConfigMap.

Version-Release number of selected component (if applicable):

4.11.>=5 and 4.12+ are currently exposed.

How reproducible:

100%

Steps to Reproduce:

1. Install 4.11.1.
2. Update to a release that defines the etcd-dashboard ConfigMap.
3. Check for etcd dashboards with oc -n openshift-config-managed get configmaps | grep etcd.

Actual results:

Both etcd-dashboard and grafana-dashboard-etcd exist:

$ oc -n openshift-config-managed get configmaps | grep etcd
etcd-dashboard                                        1      196d
grafana-dashboard-etcd                                1      2y282d

Another example is 4.11.1 to 4.11.5 CI:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1570415394001260544/artifacts/e2e-aws-upgrade/configmaps.json | jq -r '.items[].metadata | select(.namespace == "openshift-config-managed" and (.name | contains("etcd"))) | .name'
etcd-dashboard
grafana-dashboard-etcd

Expected results:

Only etcd-dashboard still exists.

Additional info:

A new manifest for the outgoing ConfigMap that sets the release.openshift.io/delete: "true" annotation would ask the cluster-version operator to reap the obsolete ConfigMap.

Description of problem:

when tries `Download kubeconfig file` for a ServiceAccount, it shows `Error: Unable to get ServiceAccount token`.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-07-064924

How reproducible:

Always

Steps to Reproduce:

1. normal user logins to ocp console
2. create a new project in which we will have 'builder' 'deployer' and 'default' serviceaccount created by default
3. click on kebab action `Download kubeconfig file` any serviceaccount 
4. goes to serviceaccount details page, click on 'Actions -> Download kubeconfig file'

Actual results:

we can see `Error: Unable to get ServiceAccount token` for both step 3 and step 4

Expected results:

user should be able to download serviceaccount kubeconfig file successfully

Additional info:

 

 

This is a clone of issue OCPBUGS-14082. The following is the description of the original issue:

Description of problem:

Since the `registry.centos.org` is closed, all the unit tests in oc relying on this registry started failing. 

Version-Release number of selected component (if applicable):

all versions

How reproducible:

trigger CI jobs and see unit tests are failing

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

the console is currently buildable only by the ART team tooling (Dockerfile.product) and by using a multi-stage Dockerfile that currently builds only for amd64.

The Dockerfile.product should (1) be able to build without the ART tooling (i.e., on Prow) and (2) on other architectures than x86.

This, in particular, makes any possible arm64 build of okd fails.

Description of problem:

Description of parameters are not shown in pipelinerun description page

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.0
OCP 4.12

How reproducible:

Always

Steps to Reproduce:

1. Create pipeline with parameters and add description to the params
2. Start the pipeline and navigate to created pipelinerun
3. Select 

Parameters

tab and check the description of the params 

Actual results:

Description feild of the params are empty

Expected results:

Description of the params should be present

Additional info:

 

Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/493

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

 We want to start controller as fast as possible after reboot and to minimize time when we don't know what is going on. This will allow us to start when kubelet will join to itself and we will be able to send joined stage to service and wait till node will become ready while controller is running already.

In order to start running while control plane kube-api doesn't exists yet , going to use bootstrap kubeconfig as it is part of filesystem.

Description of problem:
Unnecessary react warning:

Warning: Each child in a list should have a unique "key" prop.

Check the render method of `NavSection`. See https://reactjs.org/link/warning-keys for more information.
NavItemHref@http://localhost:9012/static/main-785e94355aeacc12c321.js:5141:88
NavSection@http://localhost:9012/static/main-785e94355aeacc12c321.js:5294:20
PluginNavItem@http://localhost:9012/static/main-785e94355aeacc12c321.js:5582:23
div
PerspectiveNav@http://localhost:9012/static/main-785e94355aeacc12c321.js:5398:134

Version-Release number of selected component (if applicable):
4.11 was fine
4.12 and 4.13 (master) shows this warning

How reproducible:
Always

Steps to Reproduce:
1. Open browser log
2. Open web console

Actual results:
React warning

Expected results:
Obviously no react warning

Description of problem: When visiting the Terminal tab of a Node details page, an error is displayed instead of the terminal

Steps to Reproduce:
1. Go to the Terminal tab of a Node details page (e.g., /k8s/cluster/nodes/ip-10-0-129-13.ec2.internal/terminal)
2. Note the error alert that appears on the page instead of the terminal.

Description of problem:

While viewing resource consumption for a specific pod, several graphes are stacked that should not be.  For example cpu/memory limits are a static value and thus should be a static line across a graph. However when viewing the Kubernetes / Compute Resources / Pod Dashboard I see limits are stacked above the usage.  This applies to both CPU and Memory Usage graphs on this dashboard.  When viewing the graph via inspect the visualization seems "fixed".

Version-Release number of selected component (if applicable):

OCP 4.11.19

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

At the current version 4.12 Openshift console cannot mix both stacked metrics with unstacked metrics on the same chart. 
The fix is to unstack metrics on charts having some limit markers such as request, limit, etc.
 

This is a clone of issue OCPBUGS-10152. The following is the description of the original issue:

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/62

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Dynamic plugin extensions disappear from the UI when an error is encountered loading a code reference.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1. A dynamic plugin is installed, enabled, and provides an extension where a codeRef fails to load. 
2. An interaction occurs with the extension, and the console attempts to reference the extension codeRef. 
3. The extension codeRef fails to load  

Actual results:

The extension component disappears from the console. The component will not re-appear unless the plugin is manually re-enabled, which starts this cycle over.

Expected results:

This can probably be easily reproduced by adding an intentionally "broken" extension to the demo dynamic plugin, then following the steps above 

Additional info:

There are at least 2 related bugs: 
https://bugzilla.redhat.com/show_bug.cgi?id=2071690 https://bugzilla.redhat.com/show_bug.cgi?id=2072965

Migrated from https://bugzilla.redhat.com/show_bug.cgi?id=2111629

Description of problem:

When scaling down the machineSet for worker nodes, a PV(vmdk) file got deleted.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

N/A

Steps to Reproduce:

1. Scale down worker nodes
2. Check VMware logs and VM gets deleted with vmdk still attached

Actual results:

After scaling down nodes, volumes still attached to the VM get deleted alongside the VM

Expected results:

Worker nodes scaled down without any accidental deletion

Additional info:

 

Description of problem:

The cloud controller manager fails to remove the uninitialized taint if it can't reconcile tag categories

Version-Release number of selected component (if applicable):

4.12.0, multi-zone installation

How reproducible:

consistently

Steps to Reproduce:

1. Perform a multi-zone installation in an environment without openshift-region/openshift-zone tag categories
2. Installation will fail with nodes NotReady
3.

Actual results:

The installer attempts the installation which will fail

Expected results:

The installer should report the missing tag categories

Additional info:

 

Description of problem: Installing OCP4.12 on top of Openstack 16.1 following the multi-availabilityZone installation is creating a cluster where the egressIP annotations ("cloud.network.openshift.io/egress-ipconfig") are created with empty value for the workers:

$ oc get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
ostest-kncvv-master-0         Ready    control-plane,master   9h    v1.25.4+86bd4ff
ostest-kncvv-master-1         Ready    control-plane,master   9h    v1.25.4+86bd4ff
ostest-kncvv-master-2         Ready    control-plane,master   9h    v1.25.4+86bd4ff
ostest-kncvv-worker-0-qxr5g   Ready    worker                 8h    v1.25.4+86bd4ff
ostest-kncvv-worker-1-bmvvv   Ready    worker                 8h    v1.25.4+86bd4ff
ostest-kncvv-worker-2-pbgww   Ready    worker                 8h    v1.25.4+86bd4ff
$ oc get node ostest-kncvv-worker-0-qxr5g -o json | jq -r '.metadata.annotations' 
{
  "alpha.kubernetes.io/provided-node-ip": "10.196.2.156",
  "cloud.network.openshift.io/egress-ipconfig": "null",
  "csi.volume.kubernetes.io/nodeid": "{\"cinder.csi.openstack.org\":\"8327aef0-c6a7-4bf6-8f8f-d25c9abd9bce\",\"manila.csi.openstack.org\":\"ostest-kncvv-worker-0-qxr5g\"}",
  "k8s.ovn.org/host-addresses": "[\"10.196.2.156\",\"172.17.5.154\"]",
  "k8s.ovn.org/l3-gateway-config": "{\"default\":{\"mode\":\"shared\",\"interface-id\":\"br-ex_ostest-kncvv-worker-0-qxr5g\",\"mac-address\":\"fa:16:3e:7e:b5:70\",\"ip-addresses\":[\"10.196.2.156/16\"],\"ip-address\":\"10.196.2.156/16\",\"next-hops\":[\"10.196.0.1\"],\"next-hop\":\"10.196.0.1\",\"node-port-enable\":\"true\",\"vlan-id\":\"0\"}}",
  "k8s.ovn.org/node-chassis-id": "fd777b73-aa64-4fa5-b0b1-70c3bebc2ac6",
  "k8s.ovn.org/node-gateway-router-lrp-ifaddr": "{\"ipv4\":\"100.64.0.6/16\"}",
  "k8s.ovn.org/node-mgmt-port-mac-address": "42:e8:4f:42:9f:7d",
  "k8s.ovn.org/node-primary-ifaddr": "{\"ipv4\":\"10.196.2.156/16\"}",
  "k8s.ovn.org/node-subnets": "{\"default\":\"10.128.2.0/23\"}",
  "machine.openshift.io/machine": "openshift-machine-api/ostest-kncvv-worker-0-qxr5g",
  "machineconfiguration.openshift.io/controlPlaneTopology": "HighlyAvailable",
  "machineconfiguration.openshift.io/currentConfig": "rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/desiredDrain": "uncordon-rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/lastAppliedDrain": "uncordon-rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/reason": "",
  "machineconfiguration.openshift.io/state": "Done",
  "volumes.kubernetes.io/controller-managed-attach-detach": "true"
}

Furthermore, Below is observed on openshift-cloud-network-config-controller:

$ oc logs -n openshift-cloud-network-config-controller          cloud-network-config-controller-5fcdb6fcff-6sddj | grep egress
I1212 00:34:14.498298       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-2-pbgww
I1212 00:34:15.777129       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-0-qxr5g
I1212 00:38:13.115115       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-1-bmvvv
I1212 01:58:54.414916       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-0-drd5l
I1212 02:01:03.312655       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-1-h976w
I1212 02:04:11.656408       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-2-zxwrv

Version-Release number of selected component (if applicable):

RHOS-16.1-RHEL-8-20221206.n.1
4.12.0-0.nightly-2022-12-09-063749

How reproducible:

Always

Steps to Reproduce:

1. Run AZ job on D/S CI (Openshift on Openstack QE CI)
2. Run conformance/serial tests

Actual results:

conformance/serial TCs are failing because it is not finding the egressIP annotation on the workers

Expected results:

Tests passing

Additional info:

Links provided on private comment.

Description of problem:

In tree test failures on vSphere IPI (non-zonal)

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/6770/pull-ci-openshift-installer-master-e2e-vsphere-ovn/1626234182315282432

https://github.com/openshift/installer/pull/6770

Feb 16 17:22:19.563: INFO: 
Feb 16 17:22:19.853: INFO: skipping dumping cluster info - cluster too large
[DeferCleanup (Each)] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy
  tear down framework | framework.go:193
STEP: Destroying namespace "e2e-fsgroupchangepolicy-1487" for this suite. 02/16/23 17:22:19.853

k8s.io/kubernetes/test/e2e/storage/vsphere.GetReadySchedulableNodeInfos()
	k8s.io/kubernetes@v1.26.1/test/e2e/storage/vsphere/vsphere_utils.go:756 +0x30
k8s.io/kubernetes/test/e2e/storage/drivers.(*vSphereDriver).PrepareTest.func1()
	k8s.io/kubernetes@v1.26.1/test/e2e/storage/drivers/in_tree.go:1292 +0x19
reflect.Value.call({0x77630c0?, 0x8f6f4f0?, 0x13?}, {0x89991da, 0x4}, {0xd2080c8, 0x0, 0x0?})
	reflect/value.go:584 +0x8c5
reflect.Value.Call({0x77630c0?, 0x8f6f4f0?, 0xc005c90000?}, {0xd2080c8?, 0x0?, 0xc000640840?})
	reflect/value.go:368 +0xbc
fail [runtime/panic.go:260]: Test Panicked: runtime error: invalid memory address or nil pointer dereference
Ginkgo exit error 1: exit with code 1

Description of problem:

This is wrapper bug for library sync of 4.12

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

After running several scale tests on a large cluster (252 workers), etcd ran out of space and became unavailable.

 

These tests consisted of running our node-density workload (Creates more than 50k pause pods) and cluster-density 4k several times (creates 4k namespaces with https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables).

 

The actions above leaded etcd peers to run out of free space in their 4GiB PVCs presenting the following error trace

{"level":"warn","ts":"2023-03-31T09:50:57.532Z","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"b14198cd7f0eebf1","remote-snapshot-sender-id":"a4e894c3f4af1379","incoming-snapshot-index ":19490191,"error":"write /var/lib/data/member/snap/tmp774311312: no space left on device"} 

 

Etcd uses 4GiB PVCs to store its data, which seems to be insufficient for this scenario. In addition, unlike not-hypershift clusters we're not applying any periodic database defragmentation (this is done by cluster-etcd-operator) that could lead to a higher database size

 

The graph below represents the metrics etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_byte

 

 

This is a clone of issue OCPBUGS-11922. The following is the description of the original issue:

Description of problem:

Customer was able to limit the nested repository path with "oc adm catalog mirror" by using the argument "--max-components" but there is no alternate solution along with "oc-mirror" binary while we are suggesting to use "oc-mirror" binary for mirroring.for example:
Mirroring will work if we mirror like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy
Mirroring will fail with 401 unauthorized if we add one more nested path like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

Version-Release number of selected component (if applicable):

 

How reproducible:

We can reproduce the issue by using a repository which is not supported deep nested paths

Steps to Reproduce:

1. Create a imageset to mirror any operator

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: ./oc-mirror-metadata
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
    packages:
    - name: local-storage-operator
      channels:
      - name: stable

2. Do the mirroring to a registry where its not supported deep nested repository path, Here its gitlab and its doesnt not support netsting beyond 3 levels deep.

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

this mirroring will fail with 401 unauthorized error
 
3. if  try to mirror the same imageset by removing one path it will work without any issues, like below

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy 

Actual results:

 

Expected results:

Need a alternative option of "--max-components" to limit the nested path in "oc-mirror"

Additional info:

 

Deprovisioning can fail with the error:

level=warning msg=unrecognized elastic load balancing resource type listener arn=arn:aws:elasticloadbalancing:us-west-2:460538899914:listener/net/a9ac9f1b3019c4d1299e7ededc92b42b/a6f0655da877ddd4/45e05ee69d99bab0

 

Further background is available in this write up:

https://docs.google.com/document/d/1TsTqIVwHDmjuDjG7v06w_5AAbXSisaDX-UfUI9-GVJo/edit#

 

Incident channel:

incident-aws-leaking-tags-for-deleted-resources

 

Description of problem:

In at least 4.12.0-rc.0, a user with read-only access to ClusterVersion can see an "Update blocked" pop-up talking about "...alert above the visualization...".  It is referencing a banner about "This cluster should not be updated to the next minor version...", but that banner is not displayed because hasPermissionsToUpdate is false, so canPerformUpgrade is false.

Version-Release number of selected component (if applicable):

4.12.0-rc.0. Likely more. I haven't traced it out.

How reproducible:

Always.

Steps to Reproduce:

1. Install 4.12.0-rc.0
2. Create a user with cluster-wide read-only permissions. For me, it's via binding to a sudoer ClusterRole. I'm not sure where that ClusterRole comes from, but it's:

$ oc get -o yaml clusterrole sudoer
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: "2020-05-21T19:39:09Z"
  name: sudoer
  resourceVersion: "7715"
  uid: 28eb2ffa-dccd-47e8-a2d5-6a95e0e8b1e9
rules:
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:admin
  resources:
  - systemusers
  - users
  verbs:
  - impersonate
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:masters
  resources:
  - groups
  - systemgroups
  verbs:
  - impersonate

3. View /settings/cluster

Actual results:

See the "Update blocked" pop-up talking about "...alert above the visualization...".

Expected results:

Something more internally consistent. E.g. having the referenced banner "...alert above the visualization..." show up, or not having the "Update blocked" pop-up reference the non-existent banner.

Description of problem:

In order to fetch individual helm releases ui seems to be using list releases /api/helm/releases endpoint. This would be performance impact as we can fetch a particular releases instead of sending the entire list. `/api/helm/release` endpoint taking in parameter of name and ns serves this purpose.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.Create a helm release
2.Open network fab and route to list releases page. Click on the created release.
3.You should see /api/helm/releases endpoint being used for a particular release too

Actual results:

 

Expected results:

Frontend to make use of /api/helm/release( query param ns and name) endpoint instead of calling list helm releases endpoint and filtering the content.

Additional info:

 

reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479

Description of problem:

Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

As coded, the dynamic-demo-plugin.spec.ts includes a mandatory 5 minute wait before the test continues.  This wait should be associated with an intercept so the test can continue as soon as the intercept occurs rather than having to wait to the entire 5 minutes.

Description of problem:

Currently when the oc-mirror command runs the generated ImageContentSourcePolicy.yaml should not include mirrors for the mirrored operator catalogs

This should be the case for registry located catalogs and oci fbc catalogs (located on disk)
Jennifer Power, Alex Flom can you help us confirm this is the expected behavior?

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.Run the oc mirror command mirroring the catalog
/bin/oc-mirror --config imageSetConfig.yaml  docker://localhost:5000  --use-oci-feature  --dest-use-http  --dest-skip-tls
with imagesetconfig:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /tmp/storageBackend
mirror:
  operators:
  - catalog: oci:///home/user/catalogs/rhop4.12
    # copied from registry.redhat.io/redhat/redhat-operator-index:v4.12
    targetCatalog: "mno/redhat-operator-index"
    targetVersion: "v4.12"
    packages:
    - name: aws-load-balancer-operator

Actual results:

Catalog is included in the imageContentSourcePolicy.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: localhost:5000/mno/redhat-operator-index:v4.12
  sourceType: grpc

---
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  labels:
    operators.openshift.org/catalog: "true"
  name: operator-0
spec:
  repositoryDigestMirrors:
  - mirrors:
    - localhost:5000/albo
    source: registry.redhat.io/albo
  - mirrors:
    - localhost:5000/mno
    source: mno
  - mirrors:
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

Expected results:

No catalog should be included in the imageContentSourcePolicy.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: localhost:5000/mno/redhat-operator-index:v4.12
  sourceType: grpc

---
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  labels:
    operators.openshift.org/catalog: "true"
  name: operator-0
spec:
  repositoryDigestMirrors:
  - mirrors:
    - localhost:5000/albo
    source: registry.redhat.io/albo
  - mirrors:
    - localhost:5000/openshift4
    source: registry.redhat.io/openshift4

Additional info:

 

Description of problem:

According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-25-071630

How reproducible:

Always, if the Service Usage API is disabled in the GCP project.

Steps to Reproduce:

1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project.
2. Try IPI installation in the GCP project. 

Actual results:

The installation would fail finally, without any worker machines launched.

Expected results:

Installation should succeed, or the OCP doc should be updated.

Additional info:

Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. 
FYI if enabling the API, and without changing anything else, the installation could succeed. 

Description of the problem:

When a host is rebooted with the installation ISO after the cluster has been installed it will try to register repeatedly, but the service will reject the request and generate both an error message in the log and an event. This floods both the log and the events.

How reproducible:

Always.

Steps to reproduce:

1. Install a cluster.

2. Reboot one of the nodes using the installation ISO.

Actual results:

The service rejects the registration, writes a message to the log and generates an event.
 
Expected results:

The service should ask the host to stop trying to register.

This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:

Description of problem:

according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour

$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 240
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-19-052243

How reproducible:

always

Steps to Reproduce:

1. enable UWM, check startupProbe for UWM prometheus/platform prometheus
2.
3.

Actual results:

startupProbe for UWM prometheus is still 15m

Expected results:

startupProbe for UWM prometheus should be 1 hour

Additional info:

since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
This is fixed by the first commit in the upstream Metal³ PR https://github.com/metal3-io/baremetal-operator/pull/1264

Description of problem: Title says Operand required, description says custom resource required.

Steps to Reproduce:


1. Install MCE operator
2. Wait for prompt to install operand

We should have the title say:
Installed operator: custom resource required

This will align with the description text.

FYI Laura Hinson Peter Kreuser

Description of problem:

See https://issues.redhat.com/browse/THREESCALE-9015.  A problem with the Red Hat Integration - 3scale - Managed Application Services operator prevents it from installing correctly, which results in the failure of operator-install-single-namespace.spec.ts integration test.

Description of problem:

In GCP, once an external IP address is assigned to master/infra node through GCP console, numbers of pending CSR from kubernetes.io/kubelet-serving is increasing, and the following error are reported:

I0902 10:48:29.254427       1 controller.go:121] Reconciling CSR: csr-q7bwd
I0902 10:48:29.365774       1 csr_check.go:157] csr-q7bwd: CSR does not appear to be client csr
I0902 10:48:29.371827       1 csr_check.go:545] retrieving serving cert from build04-c92hb-master-1.c.openshift-ci-build-farm.internal (10.0.0.5:10250)
I0902 10:48:29.375052       1 csr_check.go:188] Found existing serving cert for build04-c92hb-master-1.c.openshift-ci-build-farm.internal
I0902 10:48:29.375152       1 csr_check.go:192] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
I0902 10:48:29.375166       1 csr_check.go:193] Current SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5], CSR SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5 35.211.234.95]
I0902 10:48:29.375175       1 csr_check.go:202] Falling back to machine-api authorization for build04-c92hb-master-1.c.openshift-ci-build-farm.internal
E0902 10:48:29.375184       1 csr_check.go:420] csr-q7bwd: IP address '35.211.234.95' not in machine addresses: 10.0.0.5
I0902 10:48:29.375193       1 csr_check.go:205] Could not use Machine for serving cert authorization: IP address '35.211.234.95' not in machine addresses: 10.0.0.5
I0902 10:48:29.379457       1 csr_check.go:218] Falling back to serving cert renewal with Egress IP checks
I0902 10:48:29.382668       1 csr_check.go:221] Could not use current serving cert and egress IPs for renewal: CSR Subject Alternate Names includes unknown IP addresses
I0902 10:48:29.382702       1 controller.go:233] csr-q7bwd: CSR not authorized

Version-Release number of selected component (if applicable):

4.11.2

Steps to Reproduce:

1. Assign external IPs to master/infra node in GCP
2. oc get csr | grep kubernetes.io/kubelet-serving

Actual results:

CSRs are not approved

Expected results:

CSRs are approved

Additional info:

This issue is only happen in GCP. Same OpenShift installations in AWS do not have this issue.

It looks like the CSR are created using external IP addresses once assigned.

Ref: https://coreos.slack.com/archives/C03KEQZC1L2/p1662122007083059

Description of problem:

When normal user select "All namespaces" by using the radio button "Show operands in", The ""Error Loading" error will be shown 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-18-192348, 4.11

How reproducible:

Always

Steps to Reproduce:

1. Install operator "Red Hat Intergration-Camel K" on All namespace
2. Login console by using normal user
3. Navigate to "All instances" Tab for the opertor
4. Check the radio button "All namespaces" is being selected
5. Check the page 

Actual results:

The Error Loading info will be shown on page

Expected results:

The error should not shown

Additional info:

 

Description of problem:

Using assisted install from https://console.redhat.com/openshift/clusters, I am installing an openshift cluster.  The installation starts and never completes (it eventually times out after 2 of the 3 nodes have rebooted but can't join the cluster).

Looking at the installation, it seems the wrong machine network is selected (using the wrong interface). As a result, there logs like this on the nodes after the initial reboot:

"Failed to get API Group-Resources	{"error": "Get \"https://240.0.144.1:443/api?timeout=32s\": dial tcp 240.0.144.1:443: i/o timeout"}"

This seems to be due to the iptables natting of the cluster IP 240.0.144.1 on a 192.168.X.X instead of using the expected 10.65.146.X.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

additional logs/data in sf case 03317540

With the RHCOS 9.2 testing being done before merging in PR https://github.com/openshift/machine-config-operator/pull/3485 we seem to be seeing between 1-3 Azure workers dying during the post-upgrade conformance test. This problem is prevalent in seemingly all sub-jobs of https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1625681132605411328 but we'll focus on just one example for this bug:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-machine-config-operator-3485-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1625681121771524096

In this run if you view gather-extra nodes.json, we see two nodes had kubelet stop reporting status late in the run, worker w2776 stopped reporting at 5:38:29, and worker 7rftw stopped reporting at 5:18:48. These are last transition times so possible the actual time of death was a timeout period before. (possibly 30s, more details on this in a minute)

The death of the nodes is preventing a number of artifacts from being gathered, TRT will fix this ASAP in TRT-855, but this is why we don't see spyglass charts and other artifacts for the conformance suite run, only the upgrade suite which runs prior.

However I was able to identify system journal logs for the affected node in loki:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22Grafana%20Cloud%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fopenshift-machine-config-operator-3485-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade%2F1625681121771524096%5C%22%7D%20%7C%20unpack%20%7C%20host%3D%5C%22ci-op-v9k3j1h2-696e9-c68h4-worker-centralus3-7rftw%5C%22%5Cn%22,%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%22now-24h%22,%22to%22:%22now%22%7D%7D

This shows the last reported logs we got for the 7rftw node:

I0215 05:18:14.339779       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/GetPluginInfo","Request":{},"Response":{"name":"csi-mock-e2e-csi-mock-volumes-6973","vendor_version":"v1.7.3"},"Error":"","FullError":null}
I0215 05:18:13.925254       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateVolume
I0215 05:18:14.339737       1 server.go:114] GRPC response: {"name":"csi-mock-e2e-csi-mock-volumes-6973","vendor_version":"v1.7.3"}
I0215 05:18:13.925036       1 controller.go:1279] provision "e2e-csi-mock-volumes-6973/pvc-26t2d" class "e2e-csi-mock-volumes-6973-sccszgp": started
I0215 05:18:11.809725       1 controller.go:859] Started provisioner controller csi-mock-e2e-csi-mock-volumes-6973_csi-mockplugin-0_1e0cbb41-bbbc-40c7-8188-23935d0bb52b!
I0215 05:18:14.339730       1 identityserver.go:28] Using default GetPluginInfo
I0215 05:18:11.809649       1 shared_informer.go:270] caches populated
I0215 05:18:14.339684       1 server.go:105] GRPC request: {}
I0215 05:18:11.709270       1 reflector.go:255] Listing and watching *v1.PersistentVolume from sigs.k8s.io/sig-storage-lib-external-provisioner/v7/controller/controller.go:844
I0215 05:18:14.339663       1 server.go:101] GRPC call: /csi.v1.Identity/GetPluginInfo
I0215 05:18:11.709125       1 reflector.go:219] Starting reflector *v1.PersistentVolume (15m0s) from sigs.k8s.io/sig-storage-lib-external-provisioner/v7/controller/controller.go:844
I0215 05:18:14.338746       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{},"Error":"","FullError":null}
I0215 05:18:11.709166       1 reflector.go:255] Listing and watching *v1.StorageClass from sigs.k8s.io/sig-storage-lib-external-provisioner/v7/controller/controller.go:847
I0215 05:18:14.338726       1 server.go:114] GRPC response: {}
I0215 05:18:11.709125       1 reflector.go:219] Starting reflector *v1.StorageClass (15m0s) from sigs.k8s.io/sig-storage-lib-external-provisioner/v7/controller/controller.go:847
I0215 05:18:14.338668       1 server.go:105] GRPC request: {}
I0215 05:18:11.708919       1 volume_store.go:97] Starting save volume queue
I0215 05:18:11.708909       1 clone_controller.go:82] Started CloningProtection controller
I0215 05:18:14.338617       1 server.go:101] GRPC call: /csi.v1.Identity/Probe
I0215 05:18:11.708813       1 clone_controller.go:66] Starting CloningProtection controller

30 seconds later roughly, we've transitioned the node kubelet Ready condition to Unknown. Feels like a 30s timeout before we give up on the node response.

Worker ci-op-v9k3j1h2-696e9-c68h4-worker-centralus2-w2776 which died at 5:38:29 has different logs, but still in the storage area:

2023-02-15 01:37:43	
I0215 05:37:42.812659       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{},"Error":"","FullError":null}
I0215 05:37:42.812631       1 server.go:114] GRPC response: {}
I0215 05:37:42.812548       1 server.go:105] GRPC request: {}
I0215 05:37:42.812510       1 server.go:101] GRPC call: /csi.v1.Identity/Probe
I0215 05:37:42.812659       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{},"Error":"","FullError":null}
I0215 05:37:42.812631       1 server.go:114] GRPC response: {}
I0215 05:37:42.812548       1 server.go:105] GRPC request: {}
I0215 05:37:42.812510       1 server.go:101] GRPC call: /csi.v1.Identity/Probe
I0215 05:37:42.812659       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{},"Error":"","FullError":null}
I0215 05:37:42.812631       1 server.go:114] GRPC response: {}
I0215 05:37:42.812548       1 server.go:105] GRPC request: {}
I0215 05:37:42.812510       1 server.go:101] GRPC call: /csi.v1.Identity/Probe
I0215 05:37:42.812659       1 server.go:159] gRPCCall: {"Method":"/csi.v1.Identity/Probe","Request":{},"Response":{},"Error":"","FullError":null}
I0215 05:37:42.812631       1 server.go:114] GRPC response: {}
I0215 05:37:42.812548       1 server.go:105] GRPC request: {}
I0215 05:37:42.812510       1 server.go:101] GRPC call: /csi.v1.Identity/Probe
I0215 05:37:42.117439       1 controller.go:151] Successfully synced 'e2e-volume-8456/hostpath-injector'
I0215 05:37:42.117379       1 controller.go:192] Received pod 'hostpath-injector'

Kernel panic logs found for both dead nodes:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-machine-config-operator-3485-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1625681121771524096/artifacts/e2e-azure-sdn-upgrade/gather-azure-cli/artifacts/ci-op-v9k3j1h2-696e9-c68h4-worker-centralus2-w2776-boot.log

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-machine-config-operator-3485-ci-4.13-upgrade-from-stable-4.12-e2e-azure-sdn-upgrade/1625681121771524096/artifacts/e2e-azure-sdn-upgrade/gather-azure-cli/artifacts/ci-op-v9k3j1h2-696e9-c68h4-worker-centralus3-7rftw-boot.log

Extracted Kernel Panic

[ 1464.958061] general protection fault, probably for non-canonical address 0x13c0080013a0108: 0000 [#1] PREEMPT SMP PTI
[ 1464.962560] CPU: 2 PID: 134527 Comm: runc Not tainted 5.14.0-252.el9.x86_64 #1
[ 1464.965635] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[ 1464.970125] RIP: 0010:__mod_lruvec_page_state+0x80/0x160
[ 1464.972441] Code: 48 83 e0 fc 83 e2 02 74 04 48 8b 40 10 48 85 c0 0f 84 aa 00 00 00 0f 1f 44 00 00 49 63 96 00 a2 02 00 48 8b bc d0 88 08 00 00 <4c> 3b b7 88 00 00 00 0f 85 b8 00 00 00 44 89 e2 89 ee e8 19 ff ff
[ 1464.980206] RSP: 0018:ffffafa4c5b37ad8 EFLAGS: 00010086
[ 1464.982341] RAX: ffff9ec2eed7b000 RBX: ffffeb2849b038c0 RCX: 0000000000000000
[ 1464.985183] RDX: 0000000000000000 RSI: 0000000000000013 RDI: 013c0080013a0080
[ 1464.988173] RBP: 0000000000000013 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 1464.991184] R10: 0000000000000040 R11: 0000000000000003 R12: 00000000ffffffff
[ 1464.994003] R13: ffffeb2849b038c0 R14: ffff9ec53ffd5000 R15: ffff9ec4126920f8
[ 1464.996953] FS: 00007f8507f8d640(0000) GS:ffff9ec52fd00000(0000) knlGS:0000000000000000
[ 1464.999831] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1465.002083] CR2: 00007f8507751000 CR3: 00000003421ce004 CR4: 00000000003706e0
[ 1465.004977] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1465.007534] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1465.010194] Call Trace:
[ 1465.011257] <TASK>
[ 1465.012046] filemap_unaccount_folio+0x61/0x1c0
[ 1465.013802] __filemap_remove_folio+0x39/0x240
[ 1465.015481] ? xas_load+0x5/0xa0
[ 1465.016728] ? xas_find+0x171/0x1c0
[ 1465.017948] ? find_lock_entries+0x18e/0x250
[ 1465.019549] filemap_remove_folio+0x3f/0xa0
[ 1465.021113] truncate_inode_folio+0x1f/0x40
[ 1465.022904] shmem_undo_range+0x1ab/0x6e0
[ 1465.024392] ? try_to_unlazy+0x4c/0x90
[ 1465.025779] ? terminate_walk+0x61/0xf0
[ 1465.027332] ? avc_has_perm_noaudit+0x94/0x110
[ 1465.029022] ? rcu_nocb_try_bypass+0x4d/0x440
[ 1465.030600] shmem_evict_inode+0x103/0x270
[ 1465.032123] ? __ia32_sys_membarrier+0x20/0x20
[ 1465.033845] evict+0xcf/0x1d0
[ 1465.035042] do_unlinkat+0x1dc/0x2e0
[ 1465.036415] __x64_sys_unlinkat+0x33/0x60
[ 1465.037864] do_syscall_64+0x5c/0x90
[ 1465.039194] ? syscall_exit_work+0x11a/0x150
[ 1465.040875] ? syscall_exit_to_user_mode+0x12/0x30
[ 1465.043436] ? do_syscall_64+0x69/0x90
[ 1465.045527] ? syscall_exit_to_user_mode+0x12/0x30
[ 1465.048047] ? do_syscall_64+0x69/0x90
[ 1465.050176] ? syscall_exit_work+0x11a/0x150
[ 1465.052548] ? syscall_exit_to_user_mode+0x12/0x30
[ 1465.054987] ? do_syscall_64+0x69/0x90
[ 1465.057104] ? do_syscall_64+0x69/0x90
[ 1465.059253] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1465.062259] RIP: 0033:0x5587bac69d7b
[ 1465.064433] Code: e8 4a 2e fb ff eb 88 cc cc cc cc cc cc cc cc e8 bb 75 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 1465.072361] RSP: 002b:000000c0001badb0 EFLAGS: 00000202 ORIG_RAX: 0000000000000107
[ 1465.075961] RAX: ffffffffffffffda RBX: 000000c000032500 RCX: 00005587bac69d7b
[ 1465.079123] RDX: 0000000000000000 RSI: 000000c000092420 RDI: 000000000000000a
[ 1465.082276] RBP: 000000c0001bae10 R08: 00007f852f3c0501 R09: 0000000000000001
[ 1465.085456] R10: 00007f850774d910 R11: 0000000000000202 R12: 0000000000000000
[ 1465.088630] R13: 0000000000000000 R14: 000000c0000041a0 R15: 0000000000000042
[ 1465.091992] </TASK>
[ 1465.093442] Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache netfs nfsd auth_rpcgss nfs_acl lockd grace loop nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver dummy xt_sctp veth nf_conntrack_netlink xt_recent xt_statistic xt_nat xt_addrtype xt_LOG nf_log_syslog ipt_REJECT nf_reject_ipv4 xt_CT xt_MASQUERADE nft_chain_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables rfkill vxlan ip6_udp_tunnel udp_tunnel nfnetlink_cttimeout nfnetlink openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common isst_if_mbox_msr isst_if_common nfit libnvdimm kvm_intel kvm hyperv_drm irqbypass rapl drm_shmem_helper drm_kms_helper syscopyarea pcspkr sysfillrect sysimgblt hv_balloon fb_sys_fops hv_utils joydev ip_tables drm rpcrdma sunrpc rdma_ucm ib_srpt xfs ib_isert iscsi_target_mod target_core_mod ib_iser libcrc32c libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm dm_multipath ib_cm mlx5_ib
[ 1465.093517] ib_uverbs ib_core mlx5_core mlxfw psample tls pci_hyperv pci_hyperv_intf sd_mod sg nvme_tcp nvme_fabrics nvme serio_raw hid_hyperv nvme_core hv_netvsc hv_storvsc hyperv_keyboard scsi_transport_fc nvme_common t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel hv_vmbus ghash_clmulni_intel dm_mirror dm_region_hash dm_log dm_mod fuse
[ 1465.143626] ---[ end trace 155f508cadcc322c ]---
[ 1465.146224] RIP: 0010:__mod_lruvec_page_state+0x80/0x160
[ 1465.148861] Code: 48 83 e0 fc 83 e2 02 74 04 48 8b 40 10 48 85 c0 0f 84 aa 00 00 00 0f 1f 44 00 00 49 63 96 00 a2 02 00 48 8b bc d0 88 08 00 00 <4c> 3b b7 88 00 00 00 0f 85 b8 00 00 00 44 89 e2 89 ee e8 19 ff ff
[ 1465.157470] RSP: 0018:ffffafa4c5b37ad8 EFLAGS: 00010086
[ 1465.160294] RAX: ffff9ec2eed7b000 RBX: ffffeb2849b038c0 RCX: 0000000000000000
[ 1465.163748] RDX: 0000000000000000 RSI: 0000000000000013 RDI: 013c0080013a0080
[ 1465.167162] RBP: 0000000000000013 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 1465.170654] R10: 0000000000000040 R11: 0000000000000003 R12: 00000000ffffffff
[ 1465.174005] R13: ffffeb2849b038c0 R14: ffff9ec53ffd5000 R15: ffff9ec4126920f8
[ 1465.177511] FS: 00007f8507f8d640(0000) GS:ffff9ec52fd00000(0000) knlGS:0000000000000000
[ 1465.181148] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1465.184004] CR2: 00007f8507751000 CR3: 00000003421ce004 CR4: 00000000003706e0
[ 1465.187763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1465.191128] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1465.194690] Kernel panic - not syncing: Fatal exception
[ 1465.198795] Kernel Offset: 0x20400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1465.223696] Rebooting in 10 seconds..
[ 1475.227502] list_add double add: new=ffffffffa2e331c0, prev=ffffffffa2e2cf28, next=ffffffffa2e331c0.
[ 1475.232317] ------------[ cut here ]------------
[ 1475.234979] kernel BUG at lib/list_debug.c:29!
[ 1475.237600] invalid opcode: 0000 [#2] PREEMPT SMP PTI
[ 1475.240503] CPU: 2 PID: 134527 Comm: runc Tainted: G D --------- --- 5.14.0-252.el9.x86_64 #1
[ 1475.245316] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[ 1475.250685] RIP: 0010:__list_add_valid.cold+0x26/0x3f
[ 1475.253826] Code: f7 62 a8 ff 4c 89 c1 48 c7 c7 30 f9 84 a2 e8 0b f1 fe ff 0f 0b 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 e0 f9 84 a2 e8 f4 f0 fe ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 88 f9 84 a2 e8 dd f0 fe
[ 1475.263241] RSP: 0018:ffffafa4c5b37860 EFLAGS: 00010046
[ 1475.266275] RAX: 0000000000000058 RBX: ffffffffa2e331c0 RCX: 0000000000000027
[ 1475.270345] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9ec52fd198a0
[ 1475.274039] RBP: ffffffffa2e2cf28 R08: 0000000000000000 R09: 00000000ffff7fff
[ 1475.278029] R10: ffffafa4c5b37708 R11: ffffffffa31e9608 R12: ffffffffa2e331c0
[ 1475.281722] R13: 0000000000000000 R14: 0000000000000092 R15: ffffffffa2e2cf20
[ 1475.285487] FS: 00007f8507f8d640(0000) GS:ffff9ec52fd00000(0000) knlGS:0000000000000000
[ 1475.289419] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1475.292873] CR2: 00007f8507751000 CR3: 00000003421ce004 CR4: 00000000003706e0
[ 1475.296504] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1475.300061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1475.303628] Call Trace:
[ 1475.305472] <TASK>
[ 1475.307126] __register_nmi_handler+0xd3/0x130
[ 1475.309771] nmi_shootdown_cpus+0x38/0x90
[ 1475.312185] native_machine_emergency_restart+0x1b0/0x1d0
[ 1475.314903] panic+0x262/0x2c6
[ 1475.317129] oops_end.cold+0x18/0x18
[ 1475.319163] exc_general_protection+0x1b3/0x400
[ 1475.321660] asm_exc_general_protection+0x22/0x30
[ 1475.324118] RIP: 0010:__mod_lruvec_page_state+0x80/0x160
[ 1475.326807] Code: 48 83 e0 fc 83 e2 02 74 04 48 8b 40 10 48 85 c0 0f 84 aa 00 00 00 0f 1f 44 00 00 49 63 96 00 a2 02 00 48 8b bc d0 88 08 00 00 <4c> 3b b7 88 00 00 00 0f 85 b8 00 00 00 44 89 e2 89 ee e8 19 ff ff
[ 1475.335288] RSP: 0018:ffffafa4c5b37ad8 EFLAGS: 00010086
[ 1475.337989] RAX: ffff9ec2eed7b000 RBX: ffffeb2849b038c0 RCX: 0000000000000000
[ 1475.341320] RDX: 0000000000000000 RSI: 0000000000000013 RDI: 013c0080013a0080
[ 1475.344575] RBP: 0000000000000013 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 1475.347866] R10: 0000000000000040 R11: 0000000000000003 R12: 00000000ffffffff
[ 1475.351195] R13: ffffeb2849b038c0 R14: ffff9ec53ffd5000 R15: ffff9ec4126920f8
[ 1475.354434] ? __mod_lruvec_page_state+0x3e/0x160
[ 1475.356810] filemap_unaccount_folio+0x61/0x1c0
[ 1475.359215] __filemap_remove_folio+0x39/0x240
[ 1475.361610] ? xas_load+0x5/0xa0
[ 1475.363492] ? xas_find+0x171/0x1c0
[ 1475.365531] ? find_lock_entries+0x18e/0x250
[ 1475.367763] filemap_remove_folio+0x3f/0xa0
[ 1475.369999] truncate_inode_folio+0x1f/0x40
[ 1475.372361] shmem_undo_range+0x1ab/0x6e0
[ 1475.374430] ? try_to_unlazy+0x4c/0x90
[ 1475.376528] ? terminate_walk+0x61/0xf0
[ 1475.378603] ? avc_has_perm_noaudit+0x94/0x110
[ 1475.380932] ? rcu_nocb_try_bypass+0x4d/0x440
[ 1475.383093] shmem_evict_inode+0x103/0x270
[ 1475.385135] ? __ia32_sys_membarrier+0x20/0x20
[ 1475.387296] evict+0xcf/0x1d0
[ 1475.388947] do_unlinkat+0x1dc/0x2e0
[ 1475.390893] __x64_sys_unlinkat+0x33/0x60
[ 1475.392955] do_syscall_64+0x5c/0x90
[ 1475.394751] ? syscall_exit_work+0x11a/0x150
[ 1475.396993] ? syscall_exit_to_user_mode+0x12/0x30
[ 1475.399207] ? do_syscall_64+0x69/0x90
[ 1475.401059] ? syscall_exit_to_user_mode+0x12/0x30
[ 1475.403470] ? do_syscall_64+0x69/0x90
[ 1475.405458] ? syscall_exit_work+0x11a/0x150
[ 1475.407707] ? syscall_exit_to_user_mode+0x12/0x30
[ 1475.409869] ? do_syscall_64+0x69/0x90
[ 1475.411808] ? do_syscall_64+0x69/0x90
[ 1475.413633] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1475.415902] RIP: 0033:0x5587bac69d7b
[ 1475.417695] Code: e8 4a 2e fb ff eb 88 cc cc cc cc cc cc cc cc e8 bb 75 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 1475.425184] RSP: 002b:000000c0001badb0 EFLAGS: 00000202 ORIG_RAX: 0000000000000107
[ 1475.428625] RAX: ffffffffffffffda RBX: 000000c000032500 RCX: 00005587bac69d7b
[ 1475.431905] RDX: 0000000000000000 RSI: 000000c000092420 RDI: 000000000000000a
[ 1475.435140] RBP: 000000c0001bae10 R08: 00007f852f3c0501 R09: 0000000000000001
[ 1475.438132] R10: 00007f850774d910 R11: 0000000000000202 R12: 0000000000000000
[ 1475.441230] R13: 0000000000000000 R14: 000000c0000041a0 R15: 0000000000000042
[ 1475.444543] </TASK>
[ 1475.445891] Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache netfs nfsd auth_rpcgss nfs_acl lockd grace loop nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver dummy xt_sctp veth nf_conntrack_netlink xt_recent xt_statistic xt_nat xt_addrtype xt_LOG nf_log_syslog ipt_REJECT nf_reject_ipv4 xt_CT xt_MASQUERADE nft_chain_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables rfkill vxlan ip6_udp_tunnel udp_tunnel nfnetlink_cttimeout nfnetlink openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common isst_if_mbox_msr isst_if_common nfit libnvdimm kvm_intel kvm hyperv_drm irqbypass rapl drm_shmem_helper drm_kms_helper syscopyarea pcspkr sysfillrect sysimgblt hv_balloon fb_sys_fops hv_utils joydev ip_tables drm rpcrdma sunrpc rdma_ucm ib_srpt xfs ib_isert iscsi_target_mod target_core_mod ib_iser libcrc32c libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm dm_multipath ib_cm mlx5_ib
[ 1475.445971] ib_uverbs ib_core mlx5_core mlxfw psample tls pci_hyperv pci_hyperv_intf sd_mod sg nvme_tcp nvme_fabrics nvme serio_raw hid_hyperv nvme_core hv_netvsc hv_storvsc hyperv_keyboard scsi_transport_fc nvme_common t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel hv_vmbus ghash_clmulni_intel dm_mirror dm_region_hash dm_log dm_mod fuse
[ 1475.493461] ---[ end trace 155f508cadcc322d ]---
[ 1475.495786] RIP: 0010:__mod_lruvec_page_state+0x80/0x160
[ 1475.498430] Code: 48 83 e0 fc 83 e2 02 74 04 48 8b 40 10 48 85 c0 0f 84 aa 00 00 00 0f 1f 44 00 00 49 63 96 00 a2 02 00 48 8b bc d0 88 08 00 00 <4c> 3b b7 88 00 00 00 0f 85 b8 00 00 00 44 89 e2 89 ee e8 19 ff ff
[ 1475.506822] RSP: 0018:ffffafa4c5b37ad8 EFLAGS: 00010086
[ 1475.509841] RAX: ffff9ec2eed7b000 RBX: ffffeb2849b038c0 RCX: 0000000000000000
[ 1475.513155] RDX: 0000000000000000 RSI: 0000000000000013 RDI: 013c0080013a0080
[ 1475.516479] RBP: 0000000000000013 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 1475.519839] R10: 0000000000000040 R11: 0000000000000003 R12: 00000000ffffffff
[ 1475.523178] R13: ffffeb2849b038c0 R14: ffff9ec53ffd5000 R15: ffff9ec4126920f8
[ 1475.526502] FS: 00007f8507f8d640(0000) GS:ffff9ec52fd00000(0000) knlGS:0000000000000000
[ 1475.530228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1475.533110] CR2: 00007f8507751000 CR3: 00000003421ce004 CR4: 00000000003706e0
[ 1475.536549] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1475.539901] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1475.543512] Kernel panic - not syncing: Fatal exception
[ 1475.546193] Kernel Offset: 0x20400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) 

This is a clone of issue OCPBUGS-8224. The following is the description of the original issue:

Description of problem:

When install a cluster on IBM cloud, the image registry default to Removed, no storage configured after 4.13.0-ec.3
Image registry should use ibmcos object storage on IPI-IBM cluster 
https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/storage.go#L182 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-27-101545

How reproducible:

always

Steps to Reproduce:

1.Install an IPI cluster on IBM cloud
2.Check image registry after install successfully
3.

Actual results:

oc get config.image/cluster -o yaml 
  spec:
    logLevel: Normal
    managementState: Removed
    observedConfig: null
    operatorLogLevel: Normal
    proxy: {}
    replicas: 1
    requests:
      read:
        maxWaitInQueue: 0s
      write:
        maxWaitInQueue: 0s
    rolloutStrategy: RollingUpdate
    storage: {}
    unsupportedConfigOverrides: null
oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-03-02T02:21:06Z"
  generation: 1
  name: cluster
  resourceVersion: "531"
  uid: 8d61a1e2-3852-40a2-bf5d-b7f9c92cda7b
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: IBMCloud
status:
  apiServerInternalURI: https://api-int.wxjibm32.ibmcloud.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.wxjibm32.ibmcloud.qe.devcluster.openshift.com:6443
  controlPlaneTopology: HighlyAvailable
  etcdDiscoveryDomain: ""
  infrastructureName: wxjibm32-lmqh7
  infrastructureTopology: HighlyAvailable
  platform: IBMCloud
  platformStatus:
    ibmcloud:
      cisInstanceCRN: 'crn:v1:bluemix:public:internet-svcs:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:e8ee6ca1-4b31-4307-8190-e67f6925f83b::'
      location: eu-gb
      providerType: VPC
      resourceGroupName: wxjibm32-lmqh7
    type: IBMCloud 

Expected results:

Image registry should use ibmcos object storage on IPI-IBM cluster 

Additional info:

Must-gather log https://drive.google.com/file/d/1N-WUOZLRjlXcZI0t2O6MXsxwnsVPDCGQ/view?usp=share_link 

This is a clone of issue OCPBUGS-11369. The following is the description of the original issue:

Description of problem:

In the control plane machine set operator we perform e2e periodic tests that check the ability to do a rolling update of an entire OCP control plane.

This is a quite involved test as we need to drain and replace all the master machines/nodes, altering operators, waiting for machines to come up + bootstrap and nodes to drain and move their workloads to others while respecting PDBs, and etcd quorum.

As such we need to make sure we are robust to transient issues, occasionaly slow-downs and network errors.

We have investigated these timeout issues and identified some common culprits that we want to address, see: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1678966522151799

This is a clone of issue OCPBUGS-7516. The following is the description of the original issue:

Description of problem:

CPMS create two replace machines when deleting a master machine on vSphere.

Sorry, I have to revisit this https://issues.redhat.com/browse/OCPBUGS-4297 as I see all the related pr are merged, but I met twice on this template cluster
ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci, once on ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster today 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-13-235211

How reproducible:

Three times

Steps to Reproduce:

1. On this template cluster
ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci, the first time I met this is after update all the 3 master machines using RollingUpdate strategy, then I delete a master machine. But seems the redundant machine was automatically deleted, because there was only one replacement machine when I revisit it.

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-djlxv-2   Running                          47m
huliu-vs15b-75tr7-master-h76sp-1   Running                          58m
huliu-vs15b-75tr7-master-wtzb7-0   Running                          70m
huliu-vs15b-75tr7-worker-gzsp9     Running                          4h43m
huliu-vs15b-75tr7-worker-vcqqh     Running                          4h43m
winworker-4cltm                    Running                          4h19m
winworker-qd4c4                    Running                          4h19m
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15b-75tr7-master-djlxv-2
machine.machine.openshift.io "huliu-vs15b-75tr7-master-djlxv-2" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-bzd4h-2   Provisioning                          34s
huliu-vs15b-75tr7-master-djlxv-2   Deleting                              48m
huliu-vs15b-75tr7-master-gzhlk-2   Provisioning                          35s
huliu-vs15b-75tr7-master-h76sp-1   Running                               59m
huliu-vs15b-75tr7-master-wtzb7-0   Running                               70m
huliu-vs15b-75tr7-worker-gzsp9     Running                               4h44m
huliu-vs15b-75tr7-worker-vcqqh     Running                               4h44m
winworker-4cltm                    Running                               4h20m
winworker-qd4c4                    Running                               4h20m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-bzd4h-2   Running                          38m
huliu-vs15b-75tr7-master-h76sp-1   Running                          97m
huliu-vs15b-75tr7-master-wtzb7-0   Running                          108m
huliu-vs15b-75tr7-worker-gzsp9     Running                          5h22m
huliu-vs15b-75tr7-worker-vcqqh     Running                          5h22m
winworker-4cltm                    Running                          4h57m
winworker-qd4c4                    Running                          4h57m 

2.Then I change the strategy to OnDelete, and after update all the 3 master machines using OnDelete strategy, then I delete a master machine. 

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-hzhgq-0   Running                          137m
huliu-vs15b-75tr7-master-kj9zf-2   Running                          89m
huliu-vs15b-75tr7-master-kz6cx-1   Running                          59m
huliu-vs15b-75tr7-worker-gzsp9     Running                          7h46m
huliu-vs15b-75tr7-worker-vcqqh     Running                          7h46m
winworker-4cltm                    Running                          7h21m
winworker-qd4c4                    Running                          7h21m
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15b-75tr7-master-hzhgq-0
machine.machine.openshift.io "huliu-vs15b-75tr7-master-hzhgq-0" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-hzhgq-0   Deleting                              138m
huliu-vs15b-75tr7-master-kb687-0   Provisioning                          26s
huliu-vs15b-75tr7-master-kj9zf-2   Running                               90m
huliu-vs15b-75tr7-master-kz6cx-1   Running                               60m
huliu-vs15b-75tr7-master-qn6kq-0   Provisioning                          26s
huliu-vs15b-75tr7-worker-gzsp9     Running                               7h47m
huliu-vs15b-75tr7-worker-vcqqh     Running                               7h47m
winworker-4cltm                    Running                               7h22m
winworker-qd4c4                    Running                               7h22m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15b-75tr7-master-kb687-0   Running                          154m
huliu-vs15b-75tr7-master-kj9zf-2   Running                          4h5m
huliu-vs15b-75tr7-master-kz6cx-1   Running                          3h34m
huliu-vs15b-75tr7-master-qn6kq-0   Running                          154m
huliu-vs15b-75tr7-worker-gzsp9     Running                          10h
huliu-vs15b-75tr7-worker-vcqqh     Running                          10h
winworker-4cltm                    Running                          9h
winworker-qd4c4                    Running                          9h
liuhuali@Lius-MacBook-Pro huali-test % oc get co     
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      5h13m   
baremetal                                  4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
cloud-controller-manager                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
cloud-credential                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
cluster-autoscaler                         4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
config-operator                            4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
console                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      145m    
control-plane-machine-set                  4.13.0-0.nightly-2023-02-13-235211   True        False         True       10h     Observed 1 updated machine(s) in excess for index 0
csi-snapshot-controller                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
dns                                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
etcd                                       4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
image-registry                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
ingress                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
insights                                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-apiserver                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-controller-manager                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-scheduler                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
kube-storage-version-migrator              4.13.0-0.nightly-2023-02-13-235211   True        False         False      6h18m   
machine-api                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
machine-approver                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
machine-config                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      3h59m   
marketplace                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
monitoring                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
network                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
node-tuning                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
openshift-apiserver                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      145m    
openshift-controller-manager               4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
openshift-samples                          4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
operator-lifecycle-manager                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-02-13-235211   True        False         False      6h7m    
service-ca                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      10h     
storage                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      3h57m   
liuhuali@Lius-MacBook-Pro huali-test %  

3.On ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster, 
after update all the 3 master machines using RollingUpdate strategy, no issue,
then delete a master machine, no issue, 
then change the strategy to OnDelete, and replace the master machines one by one, when I delete the last one, two replace machines created.

liuhuali@Lius-MacBook-Pro huali-test % oc get co 
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      73m     
baremetal                                  4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
cloud-controller-manager                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
cloud-credential                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
cluster-autoscaler                         4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
config-operator                            4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
console                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      129m    
control-plane-machine-set                  4.13.0-0.nightly-2023-02-13-235211   True        True          False      9h      Observed 1 replica(s) in need of update
csi-snapshot-controller                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
dns                                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
etcd                                       4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
image-registry                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
ingress                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
insights                                   4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
kube-apiserver                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
kube-controller-manager                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
kube-scheduler                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
kube-storage-version-migrator              4.13.0-0.nightly-2023-02-13-235211   True        False         False      3h22m   
machine-api                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
machine-approver                           4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
machine-config                             4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
marketplace                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
monitoring                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      8h      
network                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
node-tuning                                4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
openshift-apiserver                        4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
openshift-controller-manager               4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
openshift-samples                          4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
operator-lifecycle-manager                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-02-13-235211   True        False         False      46m     
service-ca                                 4.13.0-0.nightly-2023-02-13-235211   True        False         False      9h      
storage                                    4.13.0-0.nightly-2023-02-13-235211   True        False         False      77m    
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15a-kjm6h-master-55s4l-1   Running                          84m
huliu-vs15a-kjm6h-master-ppc55-2   Running                          3h4m
huliu-vs15a-kjm6h-master-rqb52-0   Running                          53m
huliu-vs15a-kjm6h-worker-6nbz7     Running                          9h
huliu-vs15a-kjm6h-worker-g84xg     Running                          9h
liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs15a-kjm6h-master-ppc55-2
machine.machine.openshift.io "huliu-vs15a-kjm6h-master-ppc55-2" deleted
^C
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs15a-kjm6h-master-55s4l-1   Running                               85m
huliu-vs15a-kjm6h-master-cvwzz-2   Provisioning                          27s
huliu-vs15a-kjm6h-master-ppc55-2   Deleting                              3h5m
huliu-vs15a-kjm6h-master-qp9m5-2   Provisioning                          27s
huliu-vs15a-kjm6h-master-rqb52-0   Running                               54m
huliu-vs15a-kjm6h-worker-6nbz7     Running                               9h
huliu-vs15a-kjm6h-worker-g84xg     Running                               9h liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE   REGION   ZONE   AGE
huliu-vs15a-kjm6h-master-55s4l-1   Running                          163m
huliu-vs15a-kjm6h-master-cvwzz-2   Running                          79m
huliu-vs15a-kjm6h-master-qp9m5-2   Running                          79m
huliu-vs15a-kjm6h-master-rqb52-0   Running                          133m
huliu-vs15a-kjm6h-worker-6nbz7     Running                          10h
huliu-vs15a-kjm6h-worker-g84xg     Running                          10h
liuhuali@Lius-MacBook-Pro huali-test % 

Actual results:

CPMS create two replace machines when deleting a master machine, and the two replace machines exist there for a long time

Expected results:

CPMS should only create one replace machine when deleting a master machine, or quickly delete the redundant machine

Additional info:

Must-gather: https://drive.google.com/file/d/1aCyFn9okNxRz7nE3Yt_8g6Kx7sPSGCg2/view?usp=sharing for ipi-on-vsphere/versioned-installer-vmc7-ovn-winc-thin_pvc-ci template cluster
https://drive.google.com/file/d/1i0fWSP0-HqfdV5E0wcNevognLUQKecvl/view?usp=sharing for ipi-on-vsphere/versioned-installer-vmc7-ovn template cluster

Description of problem:

When running the console in development mode per https://github.com/openshift/console#frontend-development, metrics do not load on the cluster overview, pods list page, pod details page (Metrics tab is missing), etc.

Samuel Padgett suspects the changes in https://github.com/openshift/console/commit/0bd839da219462ea585183de1c856fb60e9f96fb are related.

Description of problem:

route-controller-manager ingress-to-route controller fails on validation errors when creating routes from ingresses

> ValidationError(Route.spec.to): missing required field "kind" in io.openshift.route.v1.Route.spec.to

the reason this fails is because of the Route defaulting disparity between OCP and microshift: https://issues.redhat.com/browse/OCPBUGS-4189

Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

by running this test https://github.com/openshift/origin/blob/f99813c7bd256ecb66f180788902da2f692f6676/test/extended/router/router.go#L71  

Actual results:

 

Expected results:

routes should be created from ingresses

Additional info:

 

Description of problem:

INSIGHTOCP-1048 is a rule to check if Monitoring pods are using the NFS storage, which is not recommended in OpenShift.

Gathering Persistent Volumes for openshift-monitoring namespace.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Check if the cluster contains already Persistent Volume Claims on openshift-monitoring namespace.
2. If there is none, create this ConfigMap for cluster-monitoring-config. That will setup prometheus default PVCs.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 1Gi

3. Run insights Operator.
4. In the result archive, on folder path /config/persistentvolumes/, there should be a file for each one of the Persistent Volumes bound to the PVCs.
5. Name of the file should match the PV name and contain the resource data.

Actual results:

 

Expected results:

example of persistent volume file:
{
    "metadata": {
        "name": "pvc-99ffaeb3-8ff8-4137-a1fc-0bf72e7209a5",
        "uid": "17122aab-411b-4a71-ae35-c13caac23492",
        "resourceVersion": "20098",
        "creationTimestamp": "2023-02-20T14:44:30Z",
        "labels": {
            "topology.kubernetes.io/region": "us-west-2",
            "topology.kubernetes.io/zone": "us-west-2c"
        },
        "annotations": {
            "kubernetes.io/createdby": "aws-ebs-dynamic-provisioner",
            "pv.kubernetes.io/bound-by-controller": "yes",
            "pv.kubernetes.io/provisioned-by": "kubernetes.io/aws-ebs"
        },
        "finalizers": [
            "kubernetes.io/pv-protection"
        ]
    },
    "spec": {
        "capacity": {
            "storage": "20Gi"
        },
        "awsElasticBlockStore": {
            "volumeID": "aws://us-west-2c/vol-07ecf570b7adfedda",
            "fsType": "ext4"
        },
        "accessModes": [
            "ReadWriteOnce"
        ],
        "claimRef": {
            "kind": "PersistentVolumeClaim",
            "namespace": "openshift-monitoring",
            "name": "prometheus-data-prometheus-k8s-1",
            "uid": "99ffaeb3-8ff8-4137-a1fc-0bf72e7209a5",
            "apiVersion": "v1",
            "resourceVersion": "19914"
        },
        "persistentVolumeReclaimPolicy": "Delete",
        "storageClassName": "gp2",
        "volumeMode": "Filesystem",
        "nodeAffinity": {
            "required": {
                "nodeSelectorTerms": [
                    {
                        "matchExpressions": [
                            {
                                "key": "topology.kubernetes.io/region",
                                "operator": "In",
                                "values": [
                                    "us-west-2"
                                ]
                            },
                            {
                                "key": "topology.kubernetes.io/zone",
                                "operator": "In",
                                "values": [
                                    "us-west-2c"
                                ]
                            }
                        ]
                    }
                ]
            }
        }
    },
    "status": {
        "phase": "Bound"
    }
}

Additional info:

 

Description of problem:

Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.

More description about issue is available in private comment below since it contains customer data.

Description of problem:

When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running

Version-Release number of selected component (if applicable):

quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters
hypershift operator is quay.io/hypershift/hypershift-operator:4.11
4.11.9 management cluster

How reproducible:

Happened once

Steps to Reproduce:

1.
2.
3.

Actual results:

oc get co network reports False availability

Expected results:

oc get co network reports True availability

Additional info:

 

Description of problem:

The IPI installation in some regions got bootstrap failure, and without any node available/ready.

Version-Release number of selected component (if applicable):

12-22 16:22:27.970  ./openshift-install 4.12.0-0.nightly-2022-12-21-202045
12-22 16:22:27.970  built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9
12-22 16:22:27.970  release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270
12-22 16:22:27.971  release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing) 

Actual results:

Bootstrap failed to complete

Expected results:

Installation in those regions should succeed.

Additional info:

FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/

No any node available/ready, and no any operator available.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          30m     Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors
$ oc get nodes
No resources found
$ oc get machines -n openshift-machine-api -o wide
NAME                         PHASE   TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
jiwei-1222f-v729x-master-0                                  30m                       
jiwei-1222f-v729x-master-1                                  30m                       
jiwei-1222f-v729x-master-2                                  30m                       
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication
baremetal
cloud-controller-manager                                                                          
cloud-credential                                                                                  
cluster-autoscaler                                                                                
config-operator                                                                                   
console                                                                                           
control-plane-machine-set                                                                         
csi-snapshot-controller                                                                           
dns                                                                                               
etcd                                                                                              
image-registry                                                                                    
ingress                                                                                           
insights                                                                                          
kube-apiserver                                                                                    
kube-controller-manager                                                                           
kube-scheduler                                                                                    
kube-storage-version-migrator                                                                     
machine-api                                                                                       
machine-approver                                                                                  
machine-config                                                                                    
marketplace                                                                                       
monitoring                                                                                        
network                                                                                           
node-tuning                                                                                       
openshift-apiserver                                                                               
openshift-controller-manager                                                                      
openshift-samples                                                                                 
operator-lifecycle-manager                                                                        
operator-lifecycle-manager-catalog                                                                
operator-lifecycle-manager-packageserver
service-ca
storage
$

Mater nodes don't run for example kubelet and crio services.
[core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps
FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
[core@jiwei-1222f-v729x-master-0 ~]$ 

The machine-config-daemon firstboot tells "failed to update OS".
[jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log 
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info>  [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn>  [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused
...repeated logs omitted...
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found.
Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425    2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'.
[jiwei@jiwei log-bundle-20221222085846]$ 

 

This is a clone of issue OCPBUGS-8035. The following is the description of the original issue:

Description of problem:

install discnnect private cluster, ssh to master/bootstrap nodes from the bastion on the vpc failed.

Version-Release number of selected component (if applicable):

Pre-merge build https://github.com/openshift/installer/pull/6836
registry.build05.ci.openshift.org/ci-ln-5g4sj02/release:latest
Tag: 4.13.0-0.ci.test-2023-02-27-033047-ci-ln-5g4sj02-latest

How reproducible:

always

Steps to Reproduce:

1.Create bastion instance maxu-ibmj-p1-int-svc 
2.Create vpc on the bastion host 
3.Install private disconnect cluster on the bastion host with mirror registry 
4.ssh to the bastion  
5.ssh to the master/bootstrap nodes from the bastion 

Actual results:

[core@maxu-ibmj-p1-int-svc ~]$ ssh -i ~/openshift-qe.pem core@10.241.0.5 -v
OpenSSH_8.8p1, OpenSSL 3.0.5 5 Jul 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.241.0.5 [10.241.0.5] port 22.
debug1: connect to address 10.241.0.5 port 22: Connection timed out
ssh: connect to host 10.241.0.5 port 22: Connection timed out

Expected results:

ssh succeed.

Additional info:

$ibmcloud is sg-rules r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 --vpc maxu-ibmj-p1-vpc
Listing rules of security group r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 under account OpenShift-QE as user ServiceId-dff277a9-b608-410a-ad24-c544e59e3778...
ID                                          Direction   IP version   Protocol                      Remote   
r014-6739d68f-6827-41f4-b51a-5da742c353b2   outbound    ipv4         all                           0.0.0.0/0   
r014-06d44c15-d3fd-4a14-96c4-13e96aa6769c   inbound     ipv4         all                           shakiness-perfectly-rundown-take   r014-25b86956-5370-4925-adaf-89dfca9fb44b   inbound     ipv4         tcp Ports:Min=22,Max=22       0.0.0.0/0   
r014-e18f0f5e-c4e5-44a5-b180-7a84aa59fa97   inbound     ipv4         tcp Ports:Min=3128,Max=3129   0.0.0.0/0   
r014-7e79c4b7-d0bb-4fab-9f5d-d03f6b427d89   inbound     ipv4         icmp Type=8,Code=0            0.0.0.0/0   
r014-03f23b04-c67a-463d-9754-895b8e474e75   inbound     ipv4         tcp Ports:Min=5000,Max=5000   0.0.0.0/0   
r014-8febe8c8-c937-42b6-b352-8ae471749321   inbound     ipv4         tcp Ports:Min=6001,Max=6002   0.0.0.0/0   

This is a clone of issue OCPBUGS-13021. The following is the description of the original issue:

Description of problem:

APIServer endpoint isn't healthy after a PublicAndPrivate cluster is created. PROGRESS  of the cluster is Completed and PROCESS is false, Nodes are ready, cluster operators on the guest cluster are Available, only issue is condition Type Available is False due to APIServer endpoint is not healthy.

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters
NAME   VERSION               KUBECONFIG         PROGRESS  AVAILABLE  PROGRESSING  MESSAGE
jz-test  4.14.0-0.nightly-2023-04-30-235516  jz-test-admin-kubeconfig  Completed  False    False     APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com is not healthy

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}'
PublicAndPrivate

jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jz-test
NAME                                                  READY   STATUS    RESTARTS   AGE
aws-cloud-controller-manager-666559d4f-rdsw4          2/2     Running   0          149m
aws-ebs-csi-driver-controller-79fdfb6c76-vb7wr        7/7     Running   0          148m
aws-ebs-csi-driver-operator-7dbd789984-mb9rp          1/1     Running   0          148m
capi-provider-5b7847db9-nlrvz                         2/2     Running   0          151m
catalog-operator-7ccb468d86-7c5j6                     2/2     Running   0          149m
certified-operators-catalog-895787778-5rjb6           1/1     Running   0          149m
cloud-network-config-controller-86698fd7dd-kgzhv      3/3     Running   0          148m
cluster-api-6fd4f86878-hjw59                          1/1     Running   0          151m
cluster-autoscaler-bdd688949-f9xmk                    1/1     Running   0          150m
cluster-image-registry-operator-6f5cb67d88-8svd6      3/3     Running   0          149m
cluster-network-operator-7bc69f75f4-npjfs             1/1     Running   0          149m
cluster-node-tuning-operator-5855b6576b-rckhh         1/1     Running   0          149m
cluster-policy-controller-56d4d6b57c-glx4w            1/1     Running   0          149m
cluster-storage-operator-7cc56c68bb-jd4d2             1/1     Running   0          149m
cluster-version-operator-bd969b677-bh4w4              1/1     Running   0          149m
community-operators-catalog-5c545484d7-hbzb4          1/1     Running   0          149m
control-plane-operator-fc49dcbb4-5ncvf                2/2     Running   0          151m
csi-snapshot-controller-85f7cc9945-n5vgq              1/1     Running   0          149m
csi-snapshot-controller-operator-6597b45897-hqf5p     1/1     Running   0          149m
csi-snapshot-webhook-644d765546-lk9hj                 1/1     Running   0          149m
dns-operator-5b5577d6c7-8dh8d                         1/1     Running   0          149m
etcd-0                                                2/2     Running   0          150m
hosted-cluster-config-operator-5b75ccf55d-6rzch       1/1     Running   0          149m
ignition-server-596fc9d9fb-sb94h                      1/1     Running   0          150m
ingress-operator-6497d476bc-whssz                     3/3     Running   0          149m
konnectivity-agent-6656d8dfd6-h5tcs                   1/1     Running   0          150m
konnectivity-server-5ff9d4b47-stb2m                   1/1     Running   0          150m
kube-apiserver-596fc4bb8b-7kfd8                       3/3     Running   0          150m
kube-controller-manager-6f86bb7fbd-4wtxk              1/1     Running   0          138m
kube-scheduler-bf5876b4b-flk96                        1/1     Running   0          149m
machine-approver-574585d8dd-h5ffh                     1/1     Running   0          150m
multus-admission-controller-67b6f85fbf-bfg4x          2/2     Running   0          148m
oauth-openshift-6b6bfd55fb-8sdq7                      2/2     Running   0          148m
olm-operator-5d97fb977c-sbf6w                         2/2     Running   0          149m
openshift-apiserver-5bb9f99974-2lfp4                  3/3     Running   0          138m
openshift-controller-manager-65666bdf79-g8cf5         1/1     Running   0          149m
openshift-oauth-apiserver-56c8565bb6-6b5cv            2/2     Running   0          149m
openshift-route-controller-manager-775f844dfc-jj2ft   1/1     Running   0          149m
ovnkube-master-0                                      7/7     Running   0          148m
packageserver-6587d9674b-6jwpv                        2/2     Running   0          149m
redhat-marketplace-catalog-5f6d45b457-hdn77           1/1     Running   0          149m
redhat-operators-catalog-7958c4449b-l4hbx             1/1     Running   0          12m
router-5b7899cc97-chs6t                               1/1     Running   0          150m

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                                        STATUS   ROLES    AGE    VERSION
ip-10-0-137-99.us-east-2.compute.internal   Ready    worker   131m   v1.26.2+d2e245f
ip-10-0-140-85.us-east-2.compute.internal   Ready    worker   132m   v1.26.2+d2e245f
ip-10-0-141-46.us-east-2.compute.internal   Ready    worker   131m   v1.26.2+d2e245f
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get co --kubeconfig=hostedcluster.kubeconfig 
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      126m    
csi-snapshot-controller                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
dns                                        4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
image-registry                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      128m    
ingress                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
insights                                   4.14.0-0.nightly-2023-04-30-235516   True        False         False      130m    
kube-apiserver                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-controller-manager                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-scheduler                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-storage-version-migrator              4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
monitoring                                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
network                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
node-tuning                                4.14.0-0.nightly-2023-04-30-235516   True        False         False      131m    
openshift-apiserver                        4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
openshift-controller-manager               4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
openshift-samples                          4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
operator-lifecycle-manager                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
operator-lifecycle-manager-catalog         4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
operator-lifecycle-manager-packageserver   4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
service-ca                                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      130m    
storage                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      131m    
jiezhao-mac:hypershift jiezhao$ 

HC conditions:
==============
  status:
    conditions:
    - lastTransitionTime: "2023-05-01T19:45:49Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidAWSIdentityProvider
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: Cluster version is 4.14.0-0.nightly-2023-04-30-235516
      observedGeneration: 3
      reason: FromClusterVersion
      status: "False"
      type: ClusterVersionProgressing
    - lastTransitionTime: "2023-05-01T19:46:22Z"
      message: Payload loaded version="4.14.0-0.nightly-2023-04-30-235516" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-04-30-235516"
        architecture="amd64"
      observedGeneration: 3
      reason: PayloadLoaded
      status: "True"
      type: ClusterVersionReleaseAccepted
    - lastTransitionTime: "2023-05-01T20:03:14Z"
      message: Condition not found in the CVO.
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ClusterVersionUpgradeable
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: Done applying 4.14.0-0.nightly-2023-04-30-235516
      observedGeneration: 3
      reason: FromClusterVersion
      status: "True"
      type: ClusterVersionAvailable
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: ""
      observedGeneration: 3
      reason: FromClusterVersion
      status: "True"
      type: ClusterVersionSucceeding
    - lastTransitionTime: "2023-05-01T19:47:51Z"
      message: The hosted cluster is not degraded
      observedGeneration: 3
      reason: AsExpected
      status: "False"
      type: Degraded
    - lastTransitionTime: "2023-05-01T19:45:01Z"
      message: ""
      observedGeneration: 3
      reason: QuorumAvailable
      status: "True"
      type: EtcdAvailable
    - lastTransitionTime: "2023-05-01T19:45:38Z"
      message: Kube APIServer deployment is available
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: KubeAPIServerAvailable
    - lastTransitionTime: "2023-05-01T19:44:27Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: InfrastructureReady
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: External DNS is not configured
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ExternalDNSReachable
    - lastTransitionTime: "2023-05-01T19:44:19Z"
      message: Configuration passes validation
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidHostedControlPlaneConfiguration
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: AWS KMS is not configured
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ValidAWSKMSConfig
    - lastTransitionTime: "2023-05-01T19:44:37Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidReleaseInfo
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com
        is not healthy
      observedGeneration: 3
      reason: waitingForAvailable
      status: "False"
      type: Available
    - lastTransitionTime: "2023-05-01T19:47:18Z"
      message: All is well
      reason: AWSSuccess
      status: "True"
      type: AWSEndpointAvailable
    - lastTransitionTime: "2023-05-01T19:47:18Z"
      message: All is well
      reason: AWSSuccess
      status: "True"
      type: AWSEndpointServiceAvailable
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: Configuration passes validation
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidConfiguration
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: HostedCluster is supported by operator configuration
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: SupportedHostedCluster
    - lastTransitionTime: "2023-05-01T19:45:39Z"
      message: Ignition server deployment is available
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: IgnitionEndpointAvailable
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: Reconciliation active on resource
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ReconciliationActive
    - lastTransitionTime: "2023-05-01T19:44:12Z"
      message: Release image is valid
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidReleaseImage
    - lastTransitionTime: "2023-05-01T19:44:12Z"
      message: HostedCluster is at expected version
      observedGeneration: 3
      reason: AsExpected
      status: "False"
      type: Progressing
    - lastTransitionTime: "2023-05-01T19:44:13Z"
      message: OIDC configuration is valid
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidOIDCConfiguration
    - lastTransitionTime: "2023-05-01T19:44:13Z"
      message: Reconciliation completed succesfully
      observedGeneration: 3
      reason: ReconciliatonSucceeded
      status: "True"
      type: ReconciliationSucceeded
    - lastTransitionTime: "2023-05-01T19:45:52Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: AWSDefaultSecurityGroupCreated

kube-apiserver log:
==================
E0501 19:45:07.024278       7 memcache.go:238] couldn't get current server API group list: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_authorization-openshift_01_rolebindingrestriction.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_config-operator_01_proxy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_quota-openshift_01_clusterresourcequota.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_security-openshift_01_scc.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_securityinternal-openshift_02_rangeallocation.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_apiserver-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_authentication.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_build.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_console.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_dns.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_featuregate.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_image.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagecontentpolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagecontentsourcepolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagedigestmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagetagmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_infrastructure-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_ingress.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_network.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_node.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_oauth.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_project.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_scheduler.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a PublicAndPrivate cluster

Actual results:

APIServer endpoint is not healthy, and HC condition Type 'Available' is False

Expected results:

APIServer endpoint should be healthy, and Type 'Available' should be True

Additional info:

 

Description of problem:

We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .

This RBAC was only used 4.2 and 4.3 for

  • for making a switch from configMaps to leases in leader election

and we should remove it

Version-Release number of selected component (if applicable):{code:none}

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

when provisioningNetwork is changed from Disabled to Managed/Unmanaged, the ironic-proxy daemonset is not removed

This causes the metal3 pod to be stuck in pending, since both pods are trying to use port 6385 on the host:

0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports. preemption: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports

Version-Release number of selected component (if applicable):

4.12rc.4

How reproducible:

Every time for me

Steps to Reproduce:

1. On a multinode cluster, change the provisioningNetwork from Disabled to Unmanaged (I didn't try Managed)
2.
3.

Actual results:

0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports. preemption: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports

Expected results:

I believe the ironic-proxy daemonset should be deleted when the provisioningNetwork is set to Managed/Unmanaged

Additional info:

If I manually delete the ironic-proxy Daemonset, the controller does not re-create it.

Description of problem:

Etcd's liveliness probe should be removed. 

Version-Release number of selected component (if applicable):

4.11

Additional info:

When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again..  

There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.  

General
Investigate why Cachito failed to build openshift/ose-hypershift:v4.13.0.

Context
ART's brew/OSBS build of OCP image openshift/ose-hypershift:v4.13.0 has failed.

This email is addressed to the owner(s) of this image per ART's build configuration.

Builds may fail for many reasons, some under owner control, some under ART's
control, and some in the domain of other groups. This message is only sent when
the build fails consistently, so it is unlikely this failure will resolve
itself without intervention.

The brew build task https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=50826159
failed with error message:
Exception occurred: ;;; Traceback (most recent call last): ;;;   File "/mnt/workspace/jenkins/working/aos-cd-builds/build%2Focp4/art-tools/doozer/doozerlib/distgit.py", line 1042, in build_container ;;;     task_id, task_url, build_info = asyncio.run(osbs2.build(self.metadata, profile, retries=retries)) ;;;   File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/runners.py", line 44, in run ;;;     return loop.run_until_complete(main) ;;;   File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/base_events.py", line 616, in run_until_complete ;;;     return future.result() ;;;   File "/mnt/workspace/jenkins/working/aos-cd-builds/build%2Focp4/art-tools/doozer/doozerlib/osbs2_builder.py", line 131, in build ;;;     raise OSBS2BuildError( ;;; doozerlib.osbs2_builder.OSBS2BuildError: Giving up after 1 failed attempt(s): Build failed: Fault: <Fault 2001: 'Image build failed. Error in plugin resolve_remote_source: Cachito request is in "failed" state, reason: Processing gomod dependencies failed. Request 646487 (https://cachito.engineering.redhat.com/api/v1/requests/646487/logs) tried to get repo \'https://github.com/openshift-priv/hypershift\' at reference \'6f70b463c6995907c6c0e57ff4fdd7a0bb6e8e37\'.; . OSBS build id: hypershift-rhaos-413-rhel-8-8899142247-20230217042838'> ;;;

Unfortunately there were no container build logs; something else about the build failed.

------------------------------------------------------

NOTE: These job links are only available to ART. Please contact us if
you need to see something specific from the logs.

DoD

  1. Understand why Cachito is failing to build openshift/ose-hypershift:v4.13.0.
  2. Possibly add a gate on CI for the same check to fail earlier

This is a clone of issue OCPBUGS-4963. The following is the description of the original issue:

Description of problem:

After further discussion about https://issues.redhat.com/browse/RFE-3383 we have concluded that it needs to be addressed in 4.12 since OVNK will be default there. I'm opening this so we can backport the fix.

The fix for this is simply to alter the logic around enabling nodeip-configuration to handle the VSphere-unique case of platform type == "vsphere" and the VIP field is not populated.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Similar to how, due to the install-config validation, the baremetal platform previously required a bunch of fields that are actually ignored (OCPBUGS-3278), we similarly require values for the following fields in the platform.vsphere section:

  • vCenter
  • username
  • password
  • datacenter
  • defaultDatastore

None of these values are actually used in the agent-based installer at present, and they should not be required.

Users can work around this by specifying dummy values in the platform config (note that the VIP values are required and must be genuine):

platform:
  vsphere:
    apiVIP: 192.168.111.1
    ingressVIP: 192.168.111.2
    vCenter: a
    username: b
    password: c
    datacenter: d
    defaultDatastore: e

The multicluster environment script needs to set the off cluster managed cluster proxy endpoint in order for local multicluster dev to work.

Description of problem:

not able to deploy machine with publicIp:true for Azure disconnected cluster 

Version-Release number of selected component (if applicable):

Cluster version is 4.13.0-0.nightly-2023-02-16-120330

How reproducible:

Always

Steps to Reproduce:

1.Create a machineset with publicIp true

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-17T09:54:35Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-az17a-vk8wq
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: machineset-36489
  namespace: openshift-machine-api
  resourceVersion: "227215"
  uid: e9213148-0bdf-48f1-84be-1e1a36af43c1
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-az17a-vk8wq
      machine.openshift.io/cluster-api-machineset: machineset-36489
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: huliu-az17a-vk8wq
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: machineset-36489
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/huliu-az17a-vk8wq-rg/providers/Microsoft.Compute/galleries/gallery_huliu_az17a_vk8wq/images/huliu-az17a-vk8wq-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: westus
          managedIdentity: huliu-az17a-vk8wq-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: huliu-az17a-vk8wq-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: true
          publicLoadBalancer: huliu-az17a-vk8wq
          resourceGroup: huliu-az17a-vk8wq-rg
          subnet: huliu-az17a-vk8wq-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: huliu-az17a-vk8wq-vnet
          zone: ""
status:
  fullyLabeledReplicas: 1
  observedGeneration: 1
  replicas: 1

Machine in failed status with below error :
 Error Message:           failed to reconcile machine "machineset-36489-hhjfc": network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidDomainNameLabel" Message="The domain name label -machineset-36489-hhjfc is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$." Details=[]

 

Actual results:

Machine should be created successfully as publicZone exists in the cluster DNS 
oc edit dns cluster

apiVersion: config.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: "2023-02-17T02:26:41Z"
  generation: 1
  name: cluster
  resourceVersion: "529"
  uid: a299c3d8-e8ed-4266-b842-7585d5c0632d
spec:
  baseDomain: huliu-az17a.qe.azure.devcluster.openshift.com
  privateZone:
    id: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huliu-az17a-vk8wq-rg/providers/Microsoft.Network/privateDnsZones/huliu-az17a.qe.azure.devcluster.openshift.com
  publicZone:
    id: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com
status: {}

Expected results:

machine should be running successfully 

Additional info:

Must Gather https://drive.google.com/file/d/1cPkFrTh7veO1Ph24GmVAyrs6mI3dmWYR/view?usp=sharing

This is a clone of issue OCPBUGS-8707. The following is the description of the original issue:

Description of problem:

Enabling IPSec doesn't result in IPsec tunnels being created

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Deploy & Enable IPSec

Steps to Reproduce:

1.
2.
3.

Actual results:

000 Total IPsec connections: loaded 0, active 0
000  
000 State Information: DDoS cookies not required, Accepting new IKE connections
000 IKE SAs: total(0), half-open(0), open(0), authenticated(0), anonymous(0)
000 IPsec SAs: total(0), authenticated(0), anonymous(0)

Expected results:

Active connections > 0

Additional info:

✘-1 ~/code/k8s-netperf [more-meta L|✚ 4…37⚑ 1] 
06:49 $ oc -n openshift-ovn-kubernetes -c nbdb rsh ovnkube-master-qw4zv \ovn-nbctl --no-leader-only get nb_global . ipsec
true

This is a clone of issue OCPBUGS-7620. The following is the description of the original issue:

Description of problem:
When the user edits a deployment and switches (just) the rollout "Strategy type" the form couldn't be saved because the Save button stays disabled.

Version-Release number of selected component (if applicable):
4.13

How reproducible:
Always

Steps to Reproduce:

  1. Import an application from git
  2. Select action "Edit Deployment"
  3. Change the "Strategy type" value

Actual results:
Save button stays disabled

Expected results:
Save button should enable when changing a value (that doesn't make the form state invalid)

Additional info:

This is a clone of issue OCPBUGS-11946. The following is the description of the original issue:

Description of problem:

Add storage admission plugin "storage.openshift.io/CSIInlineVolumeSecurity"

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster v 4.13
2.Check config map kas-config

Actual results:

The CM does not include "storage.openshift.io/CSIInlineVolumeSecurity" storage plugin

Expected results:

The plugin should be included

Additional info:

 

This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:

Description of problem:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.

Version-Release number of selected component (if applicable):

4.10.z, 4.11.z, 4.12, 4.13

How reproducible:

100% 

Steps to Reproduce:

1. Install 4.10.34
2. Switch from stable-4.10 to stable-4.11
3. 

Actual results:

Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them

Expected results:

Risks are computed in a timely manner for an interactive UX, lets say < 10s

Additional info:

This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be.

We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.

Description of problem:

When hypershift HostedCluster has endpointAccess: Private, the csi-snapshot-controller is in CrashLoopBackoff because the guest APIServer url in the admin-kubeconfig isn't reachable in Private mode.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/1656

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted.
- We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful.
- This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again.
- The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.

Version-Release number of selected component (if applicable):

 

How reproducible:

N/A, I wasn't able to reproduce the issue.

Steps to Reproduce:

 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Clusters created with platform 'vsphere' in the install-config end up as type 'BareMetal' in the infrastructure CR.

Version-Release number of selected component (if applicable):

4.12.3

How reproducible:

100%

Steps to Reproduce:

1. Create a cluster through the agent installer with platform: vsphere in the install-config
2. oc get infrastructure cluster -o jsonpath='{.status.platform}' 

Actual results:

BareMetal

Expected results:

VSphere

Additional info:

The platform type is not being case converted ("vsphere" -> "VSphere") when constructing the AgentClusterInstall CR. When read by the assisted-service client, the platform reads as unknown and therefore the platform field is left blank when the Cluster object is created in the assisted API. Presumably that results in the correct default platform for the topology: None for SNO, BareMetal for everything else, but never VSphere. Since the platform VIPs are passed through a non-platform-specific API in assisted, everything worked but the resulting cluster would have the BareMetal platform.

Description of problem:

When creating a pod controller (e.g. deployment) with pod spec that will be mutated by SCCs, the users might still get a warning about the pod not meeting given namespace pod security level.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

100%

Steps to Reproduce:

1. create a namespace with restricted PSa warning level (the default)
2. create a deployment with a pod with an empty security context

Actual results:

You get a warning about the deployment's pod not meeting the NS's pod security admission requirements.

Expected results:

No warning if the pod for the deployment would be properly mutated by SCCs in order to fulfill the NS's pod security requirements.

Additional info:

originally implemented as a part of https://issues.redhat.com/browse/AUTH-337

 

Description of problem:

The current version of openshift/cluster-dns-operator vendors Kubernetes 1.25 packages. OpenShift 4.13 is based on Kubernetes 1.26.   

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/cluster-dns-operator/blob/release-4.13/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/kubectl, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.25

Expected results:

Kubernetes packages are at version v0.26.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

Description of problem:

triggers[].gitlab.secretReference[1] disappears when a 'buildconfig' is edited on ‘From View’

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Always

Steps to Reproduce:

1. Configure triggers[].gitlab.secretReference[1] as below 

~~~
spec:
 .. 
  triggers:
    - type: ConfigChange
    - type: GitLab
      gitlab:
        secretReference:
          name: m24s40-githook
~~~
2. Open ‘Edit BuildConfig’ buildconfig  with ‘From’ View:
 - Buildconfigs -> Actions -> Edit Buildconfig

3. Click ‘YAML view’ on top. 

Actual results:

The 'secretReference' configured earlier has disappeared. You can click [Reload] button which will bring the configuration back.

Expected results:

'secretReference' configured in buildconfigs do not disappear. 

Additional info:


[1]https://docs.openshift.com/container-platform/4.10/rest_api/workloads_apis/buildconfig-build-openshift-io-v1.html#spec-triggers-gitlab-secretreference

 

Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/276

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Getting 2 times word "find" when kube-apiserver degraded=true if webhook matches a virtual resource in message.

oc get co kube-apiserver -o yaml

status:
conditions:

  • lastTransitionTime: "2022-01-25T13:45:32Z"
    message: |-
    ValidatingAdmissionWebhookConfigurationDegraded: test.virtual.com: unable to find find service example-service.example-namespace: service "example-service" not found
    VirtualResourceAdmissionDegraded: Validating webhook test.virtual.com matches a virtual resource subjectaccessreviews.authorization.k8s.io/v1
    reason: ValidatingAdmissionWebhookConfiguration_WebhookServiceNotFound::VirtualResourceAdmission_AdmissionWebhookMatchesVirtualResource
    status: "True"
    type: Degraded

This came up a while ago, see https://groups.google.com/u/1/a/redhat.com/g/aos-devel/c/HuOTwtI4a9I/m/nX9mKjeqAAAJ

Basically this MC:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-override
spec:
  kernelType: realtime
  osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099 

 

Will degrade the node with

 

E0301 21:25:09.234001    3306 writer.go:200] Marking Degraded due to: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: Could not depsolve transaction; 1 problem detected:
 Problem: package kernel-modules-core-5.14.0-282.el9.x86_64 requires kernel-uname-r = 5.14.0-282.el9.x86_64, but none of the providers can be installed
  - conflicting requests
: exit status 1
 

 

It's kind of annoying here because the packages to remove are now OS version dependent.  A while ago I filed https://github.com/coreos/rpm-ostree/issues/2542 which would push the problem down into rpm-ostree, which is in a better situation to deal with it, and that may be the fix...but it's also pushing the problem down there in a way that's going to be maintenance pain (but, we can deal with that).

 

It's also possible that we may need to explicitly request installation of `kernel-rt-modules-core`...I'll look.

Description of the problem:

Each monitor loop we are running ResetAutoAssignRoles and updating same number of hosts all the time. If we are resetting something we need to reset it once and stop resetting at some point but currently we are running it each loop with same number of hosts as in previous one.

time="2022-12-17T03:29:22Z" level=info msg="resetting auto-assign roles on 114 hosts in monitor" func="github.com/openshift/assisted-service/internal/host.(*Manager).resetRoleAssignmentIfNotAllRolesAreSet" file="/assisted-service/internal/host/monitor.go:123" pkg=host-state

time="2022-12-17T03:30:04Z" level=error msg="failed to refresh host 470d5fda-ad2d-f165-6772-d4b8f03de977 state" func="github.com/openshift/assisted-service/internal/host.(*Manager).clusterHostMonitoring" file="/assisted-service/internal/host/monitor.go:164" error="no condition passed to run transition RefreshHost from state disconnected" pkg=host-state request_id=29b3ec18-1ba1-4424-b0d2-5d6ab573b79f

 

How reproducible:

Look on prod logs

Steps to reproduce:

1.

2.

3.

Actual results:

Each monitor loop we reset hosts auto assign roles

Expected results:

Reset once per host

This is a clone of issue OCPBUGS-8220. The following is the description of the original issue:

Description of problem:

[CSI Inline Volume admission plugin] when using deployment/statefulset/daemonset workload with inline volume doesn't record audit logs/warning correctly

Version-Release number of selected component (if applicable):

4.13.0-0.ci.test-2023-03-02-013814-ci-ln-yd4m4st-latest (nightly build also could be reproduced)

How reproducible:

Always

Steps to Reproduce:

1. Enable feature gate to auto install the csi.sharedresource csi driver

2. Add security.openshift.io/csi-ephemeral-volume-profile: privileged to CSIDriver 'csi.sharedresource.openshift.io' # scale down the cvo,cso and shared-resource-csi-driver-operator $ oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version deployment.apps/cluster-version-operator scaled $oc scale --replicas=0 deploy/cluster-storage-operator -n openshift-cluster-storage-operator deployment.apps/cluster-storage-operator scaled $ oc scale --replicas=0 deploy/shared-resource-csi-driver-operator -n openshift-cluster-csi-drivers deployment.apps/shared-resource-csi-driver-operator scaled # Add security.openshift.io/csi-ephemeral-volume-profile: privileged to CSIDriver $ oc get csidriver/csi.sharedresource.openshift.io -o yaml apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: annotations: csi.openshift.io/managed: "true" operator.openshift.io/spec-hash: 4fc61ff54015a7e91e07b93ac8e64f46983a59b4b296344948f72187e3318b33 creationTimestamp: "2022-10-26T08:10:23Z" labels: security.openshift.io/csi-ephemeral-volume-profile: privileged

3. Create different workloads with inline volume in a restricted namespace
$ oc apply -f examples/simple 
role.rbac.authorization.k8s.io/shared-resource-my-share-pod created 
rolebinding.rbac.authorization.k8s.io/shared-resource-my-share-pod created configmap/my-config created sharedconfigmap.sharedresource.openshift.io/my-share-pod created 
Error from server (Forbidden): error when creating "examples/simple/03-pod.yaml": pods "my-csi-app-pod" is forbidden: admission denied: pod my-csi-app-pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged 
Error from server (Forbidden): error when creating "examples/simple/04-deployment.yaml": deployments.apps "mydeployment" is forbidden: admission denied: pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged 
Error from server (Forbidden): error when creating "examples/simple/05-statefulset.yaml": statefulsets.apps "my-sts" is forbidden: admission denied: pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security enforce level that is lower than privileged

4.  Add enforce: privileged label to the test ns and create different workloads with inline volume again 
$ oc label ns/my-csi-app-namespace security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=restricted pod-security.kubernetes.io/warn=restricted --overwrite
namespace/my-csi-app-namespace labeled

$ oc apply -f examples/simple                    
role.rbac.authorization.k8s.io/shared-resource-my-share-pod created
rolebinding.rbac.authorization.k8s.io/shared-resource-my-share-pod created
configmap/my-config created
sharedconfigmap.sharedresource.openshift.io/my-share-pod created
Warning: pod my-csi-app-pod uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security warn level that is lower than privileged
pod/my-csi-app-pod created
Warning: pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security warn level that is lower than privileged
deployment.apps/mydeployment created
daemonset.apps/my-ds created
statefulset.apps/my-sts created

$ oc get po                                               
NAME                            READY   STATUS    RESTARTS   AGE
my-csi-app-pod                  1/1     Running   0          34s
my-ds-cw4k7                     1/1     Running   0          32s
my-ds-sv9vp                     1/1     Running   0          32s
my-ds-v7f9m                     1/1     Running   0          32s
my-sts-0                        1/1     Running   0          31s
mydeployment-664cd95cb4-4s2cd   1/1     Running   0          33s

5. Check the api-server audit logs
$ oc adm node-logs ip-10-0-211-240.us-east-2.compute.internal --path=kube-apiserver/audit.log | grep 'uses an inline volume provided by'| tail -1 | jq . | grep 'CSIInlineVolumeSecurity'
    "storage.openshift.io/CSIInlineVolumeSecurity": "pod  uses an inline volume provided by CSIDriver csi.sharedresource.openshift.io and namespace my-csi-app-namespace has a pod security audit level that is lower than privileged"

Actual results:

In step 3 and step 4: deployment workloads the warning info pod name is empty
statefulset/daemonset workloads the warning info doesn't display
In step 5: audit logs the pod name is empty 

Expected results:

In step 3 and step 4: deployment workloads the warning info pod name should be exist
statefulset/daemonset workloads the warning info should display
In step 5: audit logs the pod name shouldn't be empty it should record the workload type and pod specific names

Additional info:

Testdata:
https://github.com/Phaow/csi-driver-shared-resource/tree/test-inlinevolume/examples/simple

Implement a config client shim that allows to intersect requests from a client and replace requests over static manifests with read-only operation over the manifests. The request replacement is no-op for object outside the of static manifest directory. In case a mutation request is required over static manifests, an error is returned.

The static manifests directory is pointed to through STATIC_CONFIG_MANIFEST_DIR environment variable. Currently only manifests from [infrastructure|network].config.openshift.io/v1 are supported as static manifests.

Description of problem:

When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel ->  Actions -> Edit Pipeline -> YAML view 

Actual results:

displayed content is incorrect.

Expected results:

Get the content of the current pipeline, not the "pipeline create" content.

Additional info:

If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content.
~
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel ->  Actions -> Edit Pipeline -> YAML view :Display resource content normally
~

 Seems that the log is full with this error:

time="2022-12-28T19:18:42Z" level=error msg="failed to list BMHs, error no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\"" func="github.com/openshift/assisted-installer/src/k8s_client.(*k8sClient).ListBMHs" file="/remote-source/app/src/k8s_client/k8s_client.go:495"
time="2022-12-28T19:18:42Z" level=error msg="Failed to list BMH hosts" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.UpdateBMHs.func2 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:665" error="no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\""
I1228 19:19:11.527245       1 request.go:601] Waited for 1.046125539s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/samples.operator.openshift.io/v1?timeout=32s
time="2022-12-28T19:19:12Z" level=error msg="failed to list BMHs, error no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\"" func="github.com/openshift/assisted-installer/src/k8s_client.(*k8sClient).ListBMHs" file="/remote-source/app/src/k8s_client/k8s_client.go:495"
time="2022-12-28T19:19:12Z" level=error msg="Failed to list BMH hosts" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.UpdateBMHs.func2 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:665" error="no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\""
time="2022-12-28T19:19:40Z" level=info msg="Console is disabled, will not wait for the console operator to be available" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.waitingForClusterOperators.func1 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:989"
time="2022-12-28T19:19:40Z" level=info msg="Checking <cvo> operator availability status" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.isOperatorAvailable file="/remote-source/app/src/assisted_installer_controller/operator_handler.go:33"
I1228 19:19:41.528259       1 request.go:601] Waited for 1.047375699s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cloudcredential.openshift.io/v1?timeout=32s
time="2022-12-28T19:19:42Z" level=error msg="failed to list BMHs, error no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\"" func="github.com/openshift/assisted-installer/src/k8s_client.(*k8sClient).ListBMHs" file="/remote-source/app/src/k8s_client/k8s_client.go:495"
time="2022-12-28T19:19:42Z" level=error msg="Failed to list BMH hosts" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.UpdateBMHs.func2 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:665" error="no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\""
I1228 19:20:11.528796       1 request.go:601] Waited for 1.047656088s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
time="2022-12-28T19:20:12Z" level=error msg="failed to list BMHs, error no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\"" func="github.com/openshift/assisted-installer/src/k8s_client.(*k8sClient).ListBMHs" file="/remote-source/app/src/k8s_client/k8s_client.go:495"
time="2022-12-28T19:20:12Z" level=error msg="Failed to list BMH hosts" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.UpdateBMHs.func2 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:665" error="no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\""
time="2022-12-28T19:20:40Z" level=info msg="Console is disabled, will not wait for the console operator to be available" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.waitingForClusterOperators.func1 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:989"
time="2022-12-28T19:20:40Z" level=info msg="Checking <cvo> operator availability status" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.isOperatorAvailable file="/remote-source/app/src/assisted_installer_controller/operator_handler.go:33"
time="2022-12-28T19:20:40Z" level=info msg="Operator <cvo> updated, status: progressing -> progressing, message: Working towards 4.12.0-rc.6: 365 of 827 done (44% complete) -> Working towards 4.12.0-rc.6: 570 of 827 done (68% complete)." func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.isOperatorAvailable file="/remote-source/app/src/assisted_installer_controller/operator_handler.go:47"
I1228 19:20:41.528821       1 request.go:601] Waited for 1.048071706s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1alpha1?timeout=32s
time="2022-12-28T19:20:42Z" level=error msg="failed to list BMHs, error no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\"" func="github.com/openshift/assisted-installer/src/k8s_client.(*k8sClient).ListBMHs" file="/remote-source/app/src/k8s_client/k8s_client.go:495"
time="2022-12-28T19:20:42Z" level=error msg="Failed to list BMH hosts" func=github.com/openshift/assisted-installer/src/assisted_installer_controller.controller.UpdateBMHs.func2 file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:665" error="no matches for kind \"BareMetalHost\" in version \"metal3.io/v1alpha1\""
I1228 19:21:11.529143       1 request.go:601] Waited for 1.047127898s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/samples.operator.openshift.io/v1?timeout=32s
 

In order to install OKD via Assisted Installer currently an additional configuration option - `OKD_RPMS` is required. This image was previously built manually and uploaded to quay.

It would be useful to include it in the payload and teach Assisted Service to extract it automatically, so that this configuration change would not be required. As a result, the same Assisted Installer can be used to install both OCP and OKD versions. Implementing this would also simplify agent-based cluster 0 installation.

cc [~andrea.fasano]

Description of problem:

It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.

Version-Release number of selected component (if applicable):

Openshift 4.8, Serverless Operator 1.27

Additional info:

https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019

 

This is a clone of issue OCPBUGS-8447. The following is the description of the original issue:

Description of problem:

The MCO must have compatibility in place one OCP version in advance if we want to bump ignition spec version, otherwise downgrades will fail.

This is NOT needed in 4.14, only 4.13

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. None atm, this is preventative for the future
2.
3.

Actual results:

N/A

Expected results:

N/A

Additional info:

 

Description of problem:

$ git --no-pager grep MachineConfigControllerPausedPoolKubeletCA.md origin/release-4.11
origin/release-4.11:install/0000_90_machine-config-operator_01_prometheus-rules.yaml:            runbook_url: https://github.com/openshift/blob/master/alerts/machine-config-operator/MachineConfigControllerPausedPoolKubeletCA.md
origin/release-4.11:install/0000_90_machine-config-operator_01_prometheus-rules.yaml:            runbook_url: https://github.com/openshift/blob/master/alerts/machine-config-operator/MachineConfigControllerPausedPoolKubeletCA.md
$ git --no-pager grep MachineConfigControllerPausedPoolKubeletCA.md origin/release-4.10
...no hits...

But that URI is broken. It should be https://github.com/openshift/runbooks/blob/master/alerts/machine-config-operator/MachineConfigControllerPausedPoolKubeletCA.md (with an additional runbooks/).

Version-Release number of selected component (if applicable):

4.11 and later, per the grep above.

How reproducible:

100%

Steps to Reproduce:

1. Follow the runbook URI for MachineConfigControllerPausedPoolKubeletCA alerts.

Actual results:

404.

Expected results:

A page walking cluster admins through a response to the alert.

Description of problem:

When trying to delete a BMH object, which is unmanaged, the Metal3 cannot delete. The BMH object is unmanaged because it does not provide information about BMC (neither address, nor credentials). 

In this case the Metal 3 tries to delete but fails and never finalizes. The BMH deletion gets stuc.
This is the log from MEtal3

{"level":"info","ts":1676531586.4898946,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.4980938,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5050912,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5105371,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.51569,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                            
{"level":"info","ts":1676531586.5191178,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.525755,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                 
{"level":"info","ts":1676531586.5356712,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676532186.5117555,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5195107,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.526355,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                           
{"level":"info","ts":1676532186.5317476,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5361836,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5404322,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5482726,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.555394,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532532.3448665,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532532.344922,"logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}
{"level":"info","ts":1676532532.3656478,"logger":"controllers.BareMetalHost","msg":"Initiating host deletion","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged"}
{"level":"error","ts":1676532532.3656952,"msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","bareMetalHost":{"name":"worker-1.el8k-ztp-1.hpecloud.org","namespace":"openshift-machine-api"},
"namespace":"openshift-machine-api","name":"worker-1.el8k-ztp-1.hpecloud.org","reconcileID":"525a5b7d-077d-4d1e-a618-33d6041feb33","error":"action \"unmanaged\" failed: failed to determine current provisioner capacity: failed to parse BMC address informa
tion: missing BMC address","errorVerbose":"missing BMC address\ngithub.com/metal3-io/baremetal-operator/pkg/hardwareutils/bmc.NewAccessDetails\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/metal3-io/baremetal-operator/pkg/hardwareu
tils/bmc/access.go:145\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:112\ngithub.com/metal3-io/baremetal-operator/pkg/pro
visioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/githu
b.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/meta
l3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal
3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareM
etalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremet
al-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/contr
oller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro
ller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\
n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to parse BMC address information\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/iro
nic/ironic.go:114\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controlle
rs/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n
\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator
/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithu
b.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controll
er.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/sr
c/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-
operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-
runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to determine current provisioner capacity\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensur
eCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:85\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal
-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machin
e.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/contr
ollers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/gi
thub.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operato
r/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-r
untime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controll
er.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\naction \"unmanaged\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operato
r/controllers/metal3.io/baremetalhost_controller.go:230\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contr
oller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller
-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.
(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/cont
roller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/contr
oller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Provide a BMH object with no BMC credentials. The BMH is set unmanaged.

Steps to Reproduce:

1. delete the object
2. gets stuck
3.

Actual results:

get stuck deletiong

Expected results:

Metal3 detects the BMH is unmanaged, and dont try to do deprovisioning.

Additional info:

 

Description of problem:

It is possible to change some of the fields in default catalogSource specs and the Marketplace Operator will not revert the changes 

Version-Release number of selected component (if applicable):

4.13.0 and back

How reproducible:

Always

Steps to Reproduce:

1. Create a 4.13.0 OpenShift cluster
2. Set the redhat-operator catalogSource.spec.grpcPodConfig.SecurityContextConfig field to `legacy`.

Actual results:

The field remains set to `legacy` mode.

Expected results:

The field is reverted to `restricted` mode.

Additional info:
This code needs to be updated to account for new fields in the catalogSource spec.

 

 

 

Description of the problem:

This validation is failing validation when adding nodes to day 2 clusters (hypershift in this case):

        {
          "id": "service-has-sufficient-spoke-kube-api-access",
          "message": "Could not create the spoke k8s client connection using kubeconfig: could not load kubeconfig from internal storage with cluster id 2dfb8854-6cdd-48fb-8a2e-79d7a26d4c8c and filename kubeconfig: object 2dfb8854-6cdd-48fb-8a2e-79d7a26d4c8c/kubeconfig was not found",
          "status": "error"
        } 

 

There's a reference to the kubeconfig on the CD:

oc get clusterdeployment -n clusters-test-infra-cluster-7776174e   test-infra-cluster-7776174e -ojsonpath='{.spec.clusterMetadata.adminKubeconfigSecretRef}'
{"name":"admin-kubeconfig"} 

The kubeconfig exists :

[root@edge-22 assisted-test-infra]# oc get secret -n clusters-test-infra-cluster-7776174e admin-kubeconfig -ojsonpath='{.data.*}' | base64 -dapiVersion: v1clusters:- cluster:    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJUUwxR29vcDg0eVl3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl5TVRFd01qRTBNREUxTWxvWApEVE15TVRBek1ERTBNREUxTWxvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUF3bTN1UitMNGNLbHMKUWNEeGR2RXFSRVNDdWFuOWVTNDNoZUtUZmUyMldDdVVVMHZNMnFuRVlDeXdUSEJ6bTJiY20yRXZjU05ZaExFQQoxK2NrdTZ1bi9yQzF4bzFQMlBSUnFUSURoZUF5cGk2bE1acXJEN1JRLzlqaDMzZk5IU21HRDREODlUdTJTVFFaClJMQ1M4L0I5Z0J2U0F2V2NIWkc0Mkt4ZGVFTlBqNEFad2o0ZFl6NFh0M3AyRVkyMHgrM25Gam40OEgzT0trVmMKLzJPeC9EbVgrV3R1YnNwUkNHWktjbWova29vS0Z2MzZIQ2V1dWtrWUFoUXNUZDRzZ3VacjZjR2U5WGVidlRQWgpYNk4zZXE3QlV6M2V1TGc1MGdoTmVLN3FHbko4c2EvZllkYUV4MStZNlNsbkhlSk5TMUV2ZWVZMldTMFhVT3RyCnBHVUpERnRnMlFJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVVJFakNtdTBiTjVETkVCUUFTREZaeHYwQ0F2VXdEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUVpeVJFU1U0VkdndGtwVVY2Yi9tNXNZamxCcHNXaWo0cU0yT2dsMEd2RVFCbzJwUHlsN1VmaS90ZE1HCkV5OWN3N1RGUTVIbSsvaGpMak9GVCtzUDB3WlZzbXRLdWtoT2ZIWGZjb1NPbUszbXJTV1QwckVsZlRYR21VbE0KbnpBVjUyTlNjZllPSGo1Z2NvTEtscldVeXhTKzFJOGV4Umx4RVY3N01aSDRYZmxERlY0T2g3eE5hMjlFN25xYgpjaW1yRFZuRzh0SURXZVdBWEpNWnBaL3c1YndRQVJZQ0VMYVhPemNuaGkrdjVWak5OaUNLMitzcmQ3b29YNTExCkZpMHFIWGpJbWRuMmVYUm1NYitHUDNjQWU2T0FZcnRrZmdFcnJpb0t1TU1RVDhxangwbXNITkFGWHN1anhic1IKd3BzelQvdmpXaldxVUpjWi9xMXpYNGNIOUZnPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==    server: https://192.168.39.6:30356  name: clustercontexts:- context:    cluster: cluster    namespace: default    user: admin  name: admincurrent-context: adminkind: Configpreferences: {}users:- name: admin  user:    client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURWekNDQWorZ0F3SUJBZ0lJQ0crMkVCNHpTNmd3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl5TVRFd01qRTBNREUxTWxvWApEVEl6TVRFd01qRTBNREUxTkZvd01ERVhNQlVHQTFVRUNoTU9jM2x6ZEdWdE9tMWhjM1JsY25NeEZUQVRCZ05WCkJBTVRESE41YzNSbGJUcGhaRzFwYmpDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUIKQU5MQnRBNG1heHB5SnVZM3BWWWhpc3VlUXlMbjN0WE9SdzNkTHBPSHM3SGN4ZTM5MFhUVXFybCtUSyt5dnFycgpwbmlOYnhZNlFvM3grQ0hhTnFCbmJ6b0RYU1RKa09MejAzSVBsTnJnYVFZTkUydTh5cmgyMmpoQjhyOGdyc1BNCnpuTE92VEtSQ0hIL0I0TEJOclFxdGI5c3NPT1NyWTNyalFWa2pWN3pCclpwanlPN0dMdGpFakhBZmZMMDh4OFIKMHJXQm9mMzd1N0xZSkljNHhEWnd0andNWlorZkFnSUV1Y1hqb0Y2U0g3RGtIOWRuSWJ1YkQ1MGJMNUo5d3R5RgpGanRPRUFnSU13MDZVZkhuWDcvTVRrRFA2QzQ5YXFRSDlaVGdDVlpsQ0M1cE9oQ2czblgvY1BqdVBzT09sbVBZCnF0TmJmRmdJWnFwVlVmUEh1cGxhY1pjQ0F3RUFBYU4vTUgwd0RnWURWUjBQQVFIL0JBUURBZ1dnTUIwR0ExVWQKSlFRV01CUUdDQ3NHQVFVRkJ3TUNCZ2dyQmdFRkJRY0RBVEFNQmdOVkhSTUJBZjhFQWpBQU1CMEdBMVVkRGdRVwpCQlJFU01LYTdSczNrTTBRRkFCSU1WbkcvUUlDOVRBZkJnTlZIU01FR0RBV2dCUkVTTUthN1JzM2tNMFFGQUJJCk1WbkcvUUlDOVRBTkJna3Foa2lHOXcwQkFRc0ZBQU9DQVFFQU5iRDhlRHNENVZtT1UvdnViV3FqVXlhT3J5M0YKUGFFT1lIelBERjZvdHR0UUl2bjNxbkpSMXVOMWdUS3RoS3M4NDZJU1dmK0lLeG1RZjRLL3J1U1VHQmZjeEROcApHNlVsWVpzK1F1MUFsQkJMTk5KbXE3U1hNR1BuOGRVKzZneVNuWW9LWm1IRW5WdUdraVprU1l0MmxyalVqa1hYCm0xTEpVUDhNUXhuRnBXalRZWUI0bkNKTzNPcDhjQ1RwME4xZTRDUTNmZUJtbzh2MHB6cFJmUWZBV3JvcTg3SWgKc1JFRUY4QmVaVk5YNDk5WWMxUHpBM3g5blp2RlRBQXFTblBybDhCUzFGTjEyUGdIanVVMVBNajdxb2pvRkg0dgpvaWxIKzdLMHBCN0MwenNZakg5bzdoQjlDVkpxVDVIOEp6WTg1MHlUZTlOQWNoUDB2R1BQVzVVMmRnPT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=    client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBMHNHMERpWnJHbkltNWplbFZpR0t5NTVESXVmZTFjNUhEZDB1azRlenNkekY3ZjNSCmROU3F1WDVNcjdLK3F1dW1lSTF2RmpwQ2pmSDRJZG8yb0dkdk9nTmRKTW1RNHZQVGNnK1UydUJwQmcwVGE3eksKdUhiYU9FSHl2eUN1dzh6T2NzNjlNcEVJY2Y4SGdzRTJ0Q3ExdjJ5dzQ1S3RqZXVOQldTTlh2TUd0bW1QSTdzWQp1Mk1TTWNCOTh2VHpIeEhTdFlHaC9mdTdzdGdraHpqRU5uQzJQQXhsbjU4Q0FnUzV4ZU9nWHBJZnNPUWYxMmNoCnU1c1BuUnN2a24zQzNJVVdPMDRRQ0FnekRUcFI4ZWRmdjh4T1FNL29MajFxcEFmMWxPQUpWbVVJTG1rNkVLRGUKZGY5dytPNCt3NDZXWTlpcTAxdDhXQWhtcWxWUjg4ZTZtVnB4bHdJREFRQUJBb0lCQUcxQzdzM0hMUTl3enFuYgpmMk8veit6d0IyNDVOMVV3czdXRVRYaytpUEpVdW1nL2hpOURjWjdvMDJqakNlWWlkUk5hZjVUT2IyS1haMFJsCmxKeGtBMDNZSUpuSnhjdGpET085SURhNDBMbktYWjhsS1JPb3lra1FKNERldUx2Wm1jMzdVQ3ErOWRuamxVazgKVWRmbHJJT3BIYXRkaDR4ajZhQTZHUEI0bmFwQzdYem03cStiaTNyd3ZIdEYxdUZrejNsYmRsWkNyU01VRDV3eQpubGg5TDEwbFJJbWdiWjRlUkxkdUdRLy83R0UxVXdmNWFWc2gxTys3N3Z1WU1KY2ZORUdCTDBSVWpKL0dHSUVxCm1OT0V0ejFQb2FOcW94MXpDKzNiV3J4MjF4eW01cXNXK20weklJZWZ6alAyaXdMcVJCeUJDN3plN2VCeHgvME8KM2ZNczFvRUNnWUVBLzJiTEdqTUNIa1c2Qll3SWwyZTRoL3VkeGgrQk02YVpNWnc4Q1JhZmx6dTkxY0xTcWpZSQplTFdlKzJwWXI5WGdYcmFGeHJTRjVGa3pJbm1hNTVZSWtmbW9qaDJ6eURFaXZidVpuT0s1czltaE9CMURpYS8xCm96RUZnQXk2WTluYTJaaXlsc3JLdkYxQnpaNTUrSVVEcmpnNUd1OVdYa2FZWmx6bGJsWkN1Q2tDZ1lFQTAwQWgKQ0l6REgwWHRsSFlZaHBSY0hsQ3U0SDlhUWRkUUc1cC80RnBDZ1R5TXdZb3VIb3MzdWg1cGg5cW92TE9GRkNiQgpDV0NTUHJrajZxNmZYOFZ1M2s1aWFaVHlsRXhlaGh6WDE3dmd5dmpEQ2hrenNqekh6dHRqMkVETE9hd25vMndTCjR5ZDArcEhuODl2dGdRdnd1TTE4MnhDNjMzYzExWmpDM3Rlb0U3OENnWUFVYktzSFlGYm1CdHQyZ0JsYnlrNm0KaVVlM3hXTTJ0emIvRWFoM2JaaTdwbXByQXRhSUNDUXJTeEw0dGl0N2ZGWWlIT1NiM0duc3RmbHg0MW13OVgyZgo5dUEwNVVrd1ZFV2IrTG16SXlxSXFIbk5IQUgvcTlPd0JrYVRVL0UvOVBjY2VhcW1obmNxRXljbEEyeHJwRytECjZqa2UzMDcvNFJOazlEN2cwUU1xNlFLQmdRQ0p5aXVCd3liR2NEc3QrZlczV20vWGlqTDIxYnFPZFoyWDA2ZVQKTSt4ckZZNk44czV3TjhocWlzbTB4a2dIaFdUSkp4b0VQc3hGUTBlTkhNZHhsWHJpWCtoTEM4OUtNYUg2QWpnNwpUQjJzNXFONUk4VVhmaE9wOW1uaXRTaVpmcFFBUVU3MGdWa0kwMENqVEJGWGVlMVM3UjJDV2lBNkFDektITEVHCjMwMlBTd0tCZ1FENDFxSURtM3hhL29zZlI0ZXdZOHl4Q3A3eHVnWENzcVlmcFpHL3g1SVg0R1orazVZV3hrbHYKQnp4aHVYVS83MitRZVBKY3BvN1owdTB0NW10eE9qYXBsNW5PVmtPa0ZsMnJDV0JRR3NBemxhWUM3Z09QSUhJMQp0TWUwZjZoVnc0a2pvRFBuS3VPTmErZUtheGZReHJNbkthV3NRMzFZMWxKdnlsRUZqUStQWWc9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=
 

 

How reproducible:

100%  if the cluster wasn't created by assisted

Steps to reproduce:

1. Create a hypershift cluster with agent platform

2. Add worker to it.

3. This is a soft validation (at least it seems to be since the agent installation will succeed) 

Actual results:

The validation is failing

Expected results:

Expected the validation to pass since there's a reference to the cluster kubeconfig on the ACI.

Description of the problem:

If an agent upgrade is requested and for some reason fails or isn't executed properly an event is created every 3 minutes or so

How reproducible:

In this triage ticket https://issues.redhat.com/browse/AITRIAGE-4203

Steps to reproduce:

1. Have an old agent

2. Fail to pull new agent

Actual results:

Tons of events

Expected results:

Some kind of backoff or maybe just not sending a new event each time.
Maybe this should be a validation (agent upgrade is failing) rather than a new event each time the agent attempts to upgrade.

There are lots of possible solutions, but this ticket had over 100 duplicate events all saying the host would download a new agent image. That's not helpful.

We have noticed that the assisted-services uses a non-trivial amount of disk.

To better control its usage/lifecycle, we should write in emptyDir mounted volumes instead of writing in the container layer.

Description of problem:
Pipelines Repository support is Tech Preview, this is shown when search for repositories or checking the details page.

But the tabbed pipelines tab (in admin and dev perspective doesn't show this). Also, the "Add Git Repository" form page doesn't mention this.

Version-Release number of selected component (if applicable):
4.11 - 4.13 (master)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Pipelines operator
  2. Navigate to Pipelines > Repository tab
  3. Select Create > Repository

Actual results:
The Repository tab and the "Add Git Repository" form page doesn't show a Tech Preview badge.

Expected results:
The Repository tab and the "Add Git Repository" form page should show a Tech Preview badge.

Additional info:
Check how the Shipwright Builds show this Tech Preview badge for the tab.

Description of problem:

A customer is raising security concerns about using port 80 for bootstrap

Version-Release number of selected component (if applicable):

4.13

RFE-3577

Description of problem:

Upgrade SNO cluster from 4.12 to 4.13, the csi-snapshot-controller is degraded with message (same with log from csi-snapshot-controller-operator): 
E1122 09:02:51.867727       1 base_controller.go:272] StaticResourceController reconciliation failed: ["csi_controller_deployment_pdb.yaml" (string): poddisruptionbudgets.policy "csi-snapshot-controller-pdb" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot delete resource "poddisruptionbudgets" in API group "policy" in the namespace "openshift-cluster-storage-operator", "webhook_deployment_pdb.yaml" (string): poddisruptionbudgets.policy "csi-snapshot-webhook-pdb" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot delete resource "poddisruptionbudgets" in API group "policy" in the namespace "openshift-cluster-storage-operator"]

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-19-191518 to 4.13.0-0.nightly-2022-11-19-182111

How reproducible:

1/1

Steps to Reproduce:

Upgrade SNO cluster from 4.12 to 4.13 

Actual results:

csi-snapshot-controller is degraded

Expected results:

csi-snapshot-controller should be healthy

Additional info:

It also happened on from scratch cluster on 4.13: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-aws-ovn-arm64-single-node/1594946128904720384

Description of problem:

Custom Victory-Core components in monitoring ui code causing build issues when updating pf to release 2022.13. PatternFly updated their victory version in this pf release and it is causing an api mismatch with the custom monitoring code. The following is the output of the build:

ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(49,50):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(79,34):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(94,35):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(134,42):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(135,42):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(162,43):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(163,43):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/graphs/tooltip.tsx(250,50):
TS2554: Expected 2 arguments, but got 1.


ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/monitoring/query-browser.tsx
ERROR in /Users/jcaiani/src/github.com/openshift/console-pf-updates/console/frontend/public/components/monitoring/query-browser.tsx(267,7):
TS2605: JSX element type 'VictoryPortal' is not a constructor function for JSX elements.
  Type 'VictoryPortal' is missing the following properties from type 'ElementClass': setState, forceUpdate, props, state, refs
Child html-webpack-plugin for "index.html":

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
The "Add Git Repository" has a "Show configuration options" expandable section that shows the required permissions for a webhook setup, and provides a link to "read more about setting up webhook".

But the permission section shows nothing when open this second expandable section, and the link doesn't do anything until the user enters a "supported" GitHub, GitLab or BitBucket URL.

Version-Release number of selected component (if applicable):
4.11-4.13

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines operator
  2. Navigate to the Developer perspective > Pipelines
  3. Press "Create" and select "Repository"
  4. Click on "Show configuration options"
  5. Click on "See Git permissions"
  6. Click on "Read more about setting up webhook"

Actual results:

  1. The Git permission section shows no git permissions.
  2. The Read more link doesn't open any new page.

Expected results:

  1. The Git permission section should show some info or must not be disabled.
  2. The Read more link should open a page or must not be displayed as well.

Additional info:

  1. None

Description of problem:

metal3 pod does not come up on SNO when creating Provisioning with provisioningNetwork set to Disabled

The issue is that on SNO, there is no Machine, and no BareMetalHost, it is looking of Machine objects to populate the provisioningMacAddresses field. However, when provisioningNetwork is Disabled, provisioningMacAddresses is not used anyway.

You can work around this issue by populating provisioningMacAddresses with a dummy address, like this:

kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningMacAddresses:
  - aa:aa:aa:aa:aa:aa
  provisioningNetwork: Disabled
  watchAllNamespaces: true

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Try to bring up Provisioning on SNO in 4.11.17 with provisioningNetwork set to Disabled

apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningNetwork: Disabled
  watchAllNamespaces: true

Steps to Reproduce:

1.
2.
3.

Actual results:

controller/provisioning "msg"="Reconciler error" "error"="machines with cluster-api-machine-role=master not found" "name"="provisioning-configuration" "namespace"="" "reconciler group"="metal3.io" "reconciler kind"="Provisioning"

Expected results:

metal3 pod should be deployed

Additional info:

This issue is a result of this change: https://github.com/openshift/cluster-baremetal-operator/pull/307
See this Slack thread: https://coreos.slack.com/archives/CFP6ST0A3/p1670530729168599

This is a clone of issue OCPBUGS-7906. The following is the description of the original issue:

Description of problem:

node-driver-registrar and hostpath containers in pod shared-resource-csi-driver-node-xxxxx under openshift-cluster-csi-drivers namespace are not pinned to reserved management cores.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Deploy SNO via ZTP with workload partitioning enabled
2. Check mgmt pods affinity
3.

Actual results:

pods do not have workload partitioning annotation, and are not pinned to mgmt cores

Expected results:

All management pods should be pinned to reserved cores

Pod should be annotated with: target.workload.openshift.io/management: '{"effect":"PreferredDuringScheduling"}'

Additional info:

pod metadata

metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["fd01:0:0:1::5f/64"],"mac_address":"0a:58:97:51:ad:31","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:1::5f/64","gateway_ip":"fd01:0:0:1::1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "fd01:0:0:1::5f"
          ],
          "mac": "0a:58:97:51:ad:31",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "fd01:0:0:1::5f"
          ],
          "mac": "0a:58:97:51:ad:31",
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: privileged
/var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/tests/workload_partitioning.go:113


SNO management workload partitioning [It] should have management pods pinned to reserved cpus
/var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/tests/workload_partitioning.go:113

  [FAILED] Expected
      <[]ranwphelper.ContainerInfo | len:3, cap:4>: [
          {
              Name: "hostpath",
              Cpus: "2-55,58-111",
              Namespace: "openshift-cluster-csi-drivers",
              PodName: "shared-resource-csi-driver-node-vzvtc",
              Shares: 10,
              Pid: 41650,
          },
          {
              Name: "cluster-proxy-service-proxy",
              Cpus: "2-55,58-111",
              Namespace: "open-cluster-management-agent-addon",
              PodName: "cluster-proxy-service-proxy-66599b78bf-k2dvr",
              Shares: 2,
              Pid: 35093,
          },
          {
              Name: "node-driver-registrar",
              Cpus: "2-55,58-111",
              Namespace: "openshift-cluster-csi-drivers",
              PodName: "shared-resource-csi-driver-node-vzvtc",
              Shares: 10,
              Pid: 34782,
          },
      ]
  to be empty
  In [It] at: /var/lib/jenkins/workspace/ocp-far-edge-vran-deployment/cnf-gotests/test/ran/workloadpartitioning/ranwphelper/ranwphelper.go:172 @ 02/22/23 01:05:00.268

cluster-proxy-service-proxy is reported in https://issues.redhat.com/browse/OCPBUGS-7652

This is a clone of issue OCPBUGS-11046. The following is the description of the original issue:

Description of problem:

The following test is permafeailing in Prow CI:
[tuningcni] sysctl allowlist update [It] should start a pod with custom sysctl only after adding sysctl to allowlist

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn-periodic/1640987392103944192


[tuningcni]
9915/go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:26
9916  sysctl allowlist update
9917  /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:141
9918    should start a pod with custom sysctl only after adding sysctl to allowlist
9919    /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156
9920  > Enter [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855
9921  < Exit [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855 (0s)
9922  > Enter [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.855
9923  < Exit [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.896 (41ms)
9924  > Enter [It] should start a pod with custom sysctl only after adding sysctl to allowlist - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156 @ 03/29/23 10:08:49.896
9925  [FAILED] Unexpected error:
9926      <*errors.errorString | 0xc00044eec0>: {
9927          s: "timed out waiting for the condition",
9928      }
9929      timed out waiting for the condition
9930  occurred9931  In [It] at: /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:186 @ 03/29/23 10:09:53.377

Version-Release number of selected component (if applicable):

master (4.14)

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Test fails

Expected results:

Test passes

Additional info:

PR https://github.com/openshift-kni/cnf-features-deploy/pull/1445 adds some useful information to the reported archive.

Description of problem:

When routes are created more than 80, SSL connections between OAuth Proxy container and HAProxy are disconnected with the following error messages
2022/12/15 21:37:01 server.go:3120: http: TLS handshake error from 10.128.18.27:47142: write tcp 10.128.10.57:8443->10.128.18.27:47142: write: connection reset by peer 
With Model serving with 100 connections made OAuth proxy container failed so model serving pod failed too.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

git clone https://github.com/Jooho/jhouse_openshift.git
cd jhouse_openshift/OAuthProxy/https-reencrypt 

oc new-project oauth-proxy
oc new-app -S php -n oauth-proxy
oc new-app --template=cakephp-mysql-example -n oauth-proxy
oc apply -f ./
oc replace -f ./svc-cakephp-mysql-example.yaml
oc scale dc/cakephp-mysql-example --replicas=2

# Wait until all pods are running. 
export Token=$(oc sa new-token user-one)
export URL=$(oc get route cakephp-mysql-example -ojsonpath='{.spec.host}')
curl -o /dev/null -I -w "%{http_code}"  --silent --location --fail --show-error --insecure https://${URL}/ -H "Authorization: Bearer ${Token}"

# Start reproducing the error
cat <<EOF> /tmp/cakephp.yaml
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  labels:
    app: cakephp-mysql-example
    template: cakephp-mysql-example
  name: cakephp-mysql-example
  namespace: oauth-proxy
spec:
  port:
    targetPort: oauth-https
  tls:
    insecureEdgeTerminationPolicy: Redirect
    termination: reencrypt
  to:
    kind: Service
    name: cakephp-mysql-example
    weight: 100
  wildcardPolicy: None
EOF

for i in {1..100} ; do sed "7s/name:.*/name: cakephp-mysql-example-$i/g" /tmp/cakephp.yaml |oc apply -f - ; done

# Check the error
oc logs dc/cakephp-mysql-example  -c oauth-proxy

Actual results:

Disconnected connections between OAuth Proxy and HAProxy

Expected results:

No errors happen

Additional info:

When I set replica of Router to 1, the issue was gone. However, if I increased it to 3, the issue was still around. So I don't think it is resource limitation issue. 

 

 

We see regular failures recently due mostly to openshift-e2e-loki ErrImagePull issues caused by Akamai caching error pages.  If we know the cause isn't a product issue we don't want to fail the payload due to these issues.

 

Kublet logs contain the failures we see recently due to an error page getting returned causing corrupt signatures error.  Log contains the locator including the namespace.  We can count these occurrences and when over a specific threshold filter alerting errors we see later on for those namespaces

7758: Feb 01 05:37:45.731611 ci-op-vyccmv3h-4ef92-xs5k5-master-0 kubenswrapper[2213]: E0201 05:37:45.730879 2213 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"oauth-proxy\" with ErrImagePull: \"rpc error: code = Unknown desc = copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/openshift4/ose-oauth-proxy@sha256=f968922564c3eea1c69d6bbe529d8970784d6cae8935afaf674d9fa7c0f72ea3/signature-9: unrecognized signature format, starting with binary 0x3c\"" pod="openshift-e2e-loki/loki-promtail-plm74" podUID=59b26cbf-3421-407c-98ee-986b5a091ef4

 

We can extract the namespace from

pod="openshift-e2e-loki/loki-promtail-plm74"

Then when evalutating the alerts to check for failures, filter them out if we know that there have been X number of errors seen in the kublet logs.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620652883970101248

: [bz-Unknown][invariant] alert/KubePodNotReady should not be at or above info in all the other namespaces expand_less
              0s 
              
                {  KubePodNotReady was at or above info for at least 2h47m30s on platformidentification.JobType{Release:"4.13", FromRelease:"4.12", Platform:"gcp", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 3m52s, firing for 2h47m30s:

Feb 01 06:10:48.338 - 2398s W alert/KubePodNotReady ns/openshift-e2e-loki pod/loki-promtail-ld26r ALERTS{alertname="KubePodNotReady", alertstate="firing", namespace="openshift-e2e-loki", pod="loki-promtail-ld26r", prometheus="openshift-monitoring/k8s", severity="warning"}

 

 [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when 
installed on the cluster shouldn't report any alerts in firing state 
apart from Watchdog and AlertmanagerReceiversNotConfigured 
[Early][apigroup:config.openshift.io] [Skipped:Disconnected] 
[Suite:openshift/conformance/parallel] expand_less
                          Run #0: Failed expand_less
                          1m2s
                          
                            {  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:522]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        <*errors.errorString | 0xc0014a0900>{
            s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|KubeJobFailed|Watchdog|KubePodNotReady|...\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"KubeContainerWaiting\",\n      \"alertstate\": \"firing\",\n      \"container\": \"oauth-proxy\",\n      \"namespace\": \"openshift-e2e-loki\",\n      \"pod\": \"loki-promtail-tfrnc\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1675236853.465,\n      \"1\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"KubeDaemonSetRolloutStuck\",\n      \"alertstate\": \"firing\",\n      \"container\": \"kube-rbac-proxy-main\",\n      \"daemonset\": \"loki-promtail\",\n      \"endpoint\": \"https-main\",\n      \"job\": \"kube-state-metrics\",\n      \"namespace\": \"openshift-e2e-loki\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"kube-state-metri...
 

 

 

 

This is a clone of issue OCPBUGS-3505. The following is the description of the original issue:

Description of problem:

While installing cluster with assisted installer lately we have cases when one of the master joins very quickly and start all needed pods in order for cluster bootstrap to finish but the second one joins only after that.
Keepalived can't start if there is only one joined cluster as it doesn't have enough data to build configuration files.
In HA mode cluster bootstrap should wait at least for 2 joined masters before removing bootstrap control plane as without it installation with fail.
 

Version-Release number of selected component (if applicable):

 

How reproducible:

Start bm installation and start one master, wait till it starts all required pods and then add others.

Steps to Reproduce:

1. Start bm installation 
2. Start one master 
3. Wait till it starts all required pods.
4. Add others

Actual results:

no vip, installation fails

Expected results:

installation succeeds, vip moves to master

Additional info:

 

This is a clone of issue OCPBUGS-9985. The following is the description of the original issue:

Description of problem:

DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN.

Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012

This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug.

UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.

Version-Release number of selected component (if applicable):

4.13, 4.12, 4.11

How reproducible:

100%

Steps to Reproduce:

1. oc debug -n openshift-dns
2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind
# Retry multiple times, and you should always get the same local DNS pod.

Actual results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-dnbsp"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"

Expected results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8" 

Additional info:

https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working.

iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing 

Description of problem
`oc-mirror` will hit error when use docker without namespace for OCI format mirror

How reproducible:
always

Steps to Reproduce:
Copy the operator image with OCI format to localhost;
cat copy.yaml
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
operators:

  • catalog: registry.redhat.io/redhat/redhat-operator-index:v4.11
    packages:
  • name: multicluster-engine
    minVersion: '2.1.1'
    maxVersion: '2.1.2'
    `oc-mirror --config copy.yaml oci:///home/ocmirrortest/noo --use-oci-feature --oci-feature-action=copy --continue-on-error`
    Mirror the operator image with OCI format to registry without namespace :
    cat mirror.yaml
    apiVersion: mirror.openshift.io/v1alpha2
    kind: ImageSetConfiguration
    mirror:
    operators:
  • catalog: oci:///home/ocmirrortest/noo/redhat-operator-index
    packages:
  • name: multicluster-engine
    minVersion: '2.1.1'
    maxVersion: '2.1.2'

`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000`

Actual results:
2. Hit error:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000`
……
info: Mirroring completed in 30ms (0B/s)
error: mirroring images "localhost:5000//multicluster-engine/mce-operator-bundle@sha256:e7519948bbcd521390d871ccd1489a49aa01d4de4c93c0b6972dfc61c92e0ca2" is not a valid image reference: invalid reference format

Expected results:
2. No error

Additional info:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000/ocmir` works well.

This is a clone of issue OCPBUGS-10647. The following is the description of the original issue:

Description of problem:

Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations.

When CNO is managed by Hypershift, multus-admission-controller must run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod or container security context to most deployments with runAsUser clause inside.

In Hypershift CPO, the security context of deployment containers, including CNO, is set when it detects that SCC's are not available, see https://github.com/openshift/hypershift/blob/9d04882e2e6896d5f9e04551331ecd2129355ecd/support/config/deployment.go#L96-L100. In such a case CNO should do the same, set security context for its managed deployment multus-admission-controller to meet Hypershift standard.

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift using Kube management cluster
2.Check pod security context of multus-admission-controller

Actual results:

no pod security context is set

Expected results:

pod security context is set with runAsUser: xxxx

Additional info:

This is the highest priority item from https://issues.redhat.com/browse/OCPBUGS-7942 and it needs to be fixed ASAP as it is a security issue preventing IBM from releasing Hypershift-managed Openshift service.

Description of problem:

When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].

[1]https://github.com/openshift/cluster-image-registry-operator/pull/819

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Deploy a cluster with proxy and restricted installation
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

{{

Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: panic: runtime error: invalid memory address or nil pointer dereference
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x144b69b]
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: goroutine 1 [running]:
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/openshift/assisted-installer-agent/src/session.createBmInventoryClient.func1.1({0x0, 0xc00017b100, 0x0, {0x19feca0, 0xc000916460}})
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/src/session/inventory_session.go:159 +0x19b
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/PuerkitoBio/rehttp.toRetryFn.func1({0x0, 0xc00017b100, 0x0, {0x19feca0, 0xc000916460}})
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/vendor/github.com/PuerkitoBio/rehttp/rehttp.go:122 +0x76
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/PuerkitoBio/rehttp.(*Transport).RoundTrip(0xc00031d120, 0xc00017b100)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/vendor/github.com/PuerkitoBio/rehttp/rehttp.go:312 +0x37e
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: net/http.send(0xc00017b100,

{0x19fb840, 0xc00031d120}

,

{0x16fcd00, 0x175fd01, 0x0}

)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /usr/lib/golang/src/net/http/client.go:252 +0x5d8
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: net/http.(*Client).send(0xc000ae5da0, 0xc00017b100,

{0x1782b81, 0x0, 0x0}

)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /usr/lib/golang/src/net/http/client.go:176 +0x9b
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: net/http.(*Client).do(0xc000ae5da0, 0xc00017b100)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /usr/lib/golang/src/net/http/client.go:725 +0x908
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: net/http.(*Client).Do(...)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /usr/lib/golang/src/net/http/client.go:593
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/go-openapi/runtime/client.(*Runtime).Submit(0xc0004a02a0, 0xc00044a9c0)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/vendor/github.com/go-openapi/runtime/client/runtime.go:471 +0x465
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/openshift/assisted-service/client/installer.(*Client).V2RegisterHost(0xc000ae57a0,

{0x1a274d0, 0xc000ae58c0}

, 0xc000451b00)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/vendor/github.com/openshift/assisted-service/client/installer/installer_client.go:1296 +0x2b4
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/openshift/assisted-installer-agent/src/commands.(*v2ServiceAPI).RegisterHost(0xc0002a7cc0, 0xc000ae5890)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/src/commands/service_api.go:43 +0x20b
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/openshift/assisted-installer-agent/src/commands.RegisterHostWithRetry(0xc0004a01c0,

{0x1a69460, 0xc000386e00})
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/src/commands/register_node.go:23 +0x14e
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: github.com/openshift/assisted-installer-agent/src/agent.RunAgent(0xc0004a01c0, {0x19fdbe0, 0x2707710}, {0x1a69460, 0xc000386e00}

)
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/src/agent/agent.go:42 +0x76
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: main.main()
Nov 30 12:27:48 wsfd-advnetlab52.anl.lab.eng.bos.redhat.com agent[8125]: /remote-source/app/src/agent/main/main.go:15 +0xac
}}

With CSISnapshot capability is disabled, all CSI driver operators are Degraded. For example AWS EBS CSI driver operator during installation:

18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource
AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: )
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):
4.12.nightly

The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.

CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.

While the installer binary is statically linked, the terraform binaries shipped with it are dynamically linked.

This could give issues when running the installer on Linux and depending on the GLIBC version the specific Linux distribution has installed. It becomes a risk when switching the base image of the builders from ubi8 to ubi9 and trying to run the installer in cs8 or rhel8.

For example, building the installer on cs9 and trying to run it in a cs8 distribution leads to:

time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform version -json"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform version -json"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/root/test/terraform/plugins"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"cluster\" stage: failed to create cluster: failed doing terraform init: exit status 1\n/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)\n/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)\n"

How reproducible:Always

Steps to Reproduce:{code:none}
1. Build the installer on cs9
2. Run the installer on cs8 until the terraform binary are started
3. Looking at the terrform binary with ldd or file, you can get it is not a statically linked binary and the error above might occur depending on the glibc version you are running on 

Actual results:

 

Expected results:

The terraform and providers binaries have to be statically linked as well as the installer is.

Additional info:

This comes from a build of OKD/SCOS that is happening outside of Prow on a cs9-based builder image.

One can use the Dockerfile at images/installer/Dockerfile.ci and replace the builder image with one like https://github.com/okd-project/images/blob/main/okd-builder.Dockerfile

Description of problem:

In at least 4.12.0-rc.0, a user with read-only access to ClusterVersion can see a "Control plane is hosted" banner (despite the control plane not being hosted), because hasPermissionsToUpdate is false, so canPerformUpgrade is false.

Version-Release number of selected component (if applicable):

4.12.0-rc.0. Likely more. I haven't traced it out.

How reproducible:

Always.

Steps to Reproduce:

1. Install 4.12.0-rc.0
2. Create a user with cluster-wide read-only permissions. For me, it's via binding to a sudoer ClusterRole. I'm not sure where that ClusterRole comes from, but it's:

$ oc get -o yaml clusterrole sudoer
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: "2020-05-21T19:39:09Z"
  name: sudoer
  resourceVersion: "7715"
  uid: 28eb2ffa-dccd-47e8-a2d5-6a95e0e8b1e9
rules:
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:admin
  resources:
  - systemusers
  - users
  verbs:
  - impersonate
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:masters
  resources:
  - groups
  - systemgroups
  verbs:
  - impersonate

3. View /settings/cluster

Actual results:

See the "Control plane is hosted" banner.

Expected results:

Possible cases:

  • For me in my impersonate group, I can trigger updates via the command-line by using --as system:admin. I don't know if the console supports impersonation, or wants to mention the option if it does not.
  • For users with read-only access in stand-alone clusters, telling the user they are not authorized to update makes sense. Maybe mention that their cluster admins may be able to update, or just leave that unsaid.
  • For users with managed/dedicated branding, possibly point out that updates in that environment happen via OCM. And leave it up to OCM to decide if that user has access.
  • For users with externally-hosted control planes, possibly tell them this regardless of whether they have the ability to update via some external interface or not. For externally-hosted, Red-Hat-managed clusters, the interface will presumably be OCM. For externally-hosted, customer-managed clusters, there may be some ACM or other interface? I'm not sure. But the message of "this in-cluster web console is not where you configure this stuff, even if you are one of the people who can make these decisions for this cluster" will apply for all hosted situations.

This is a clone of issue OCPBUGS-9969. The following is the description of the original issue:

Description of problem:

OCP cluster born on 4.1 fails to scale-up node due to older podman version 1.0.2 present in 4.1 bootimage. This was observed while testing bug https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21889975&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21889975

Journal log:
- Unit machine-config-daemon-update-rpmostree-via-container.service has finished starting up.
--
-- The start-up result is RESULT.
Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: flag provided but not defined: -authfile
Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: See 'podman run --help'.
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=125/n/a
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'.
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 24ms CPU time

Version-Release number of selected component (if applicable):

OCP 4.12 and later

Steps to Reproduce:

1.Upgrade a 4.1 based cluster to 4.12 or later version
2. Try to Scale up node
3. Node will fail to join

 

Additional info:  https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21890647&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21890647

The install_type field in telemetry data is not automatically set from the installer invoker value. Any values we wish to appear must be explicity converted to the corresponding install_type value.

Currently this make clusters installed with the agent-based installer (invoker agent-installer) invisible in telemetry.

The kube-state-metric pod inside the openshift-monitoring namespace is not running as expected.

On checking the logs I am able to see that there is a memory panic

~~~
2022-11-22T09:57:17.901790234Z I1122 09:57:17.901768 1 main.go:199] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
2022-11-22T09:57:17.901975837Z I1122 09:57:17.901951 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.902389844Z I1122 09:57:17.902291 1 main.go:210] Starting metrics server: 127.0.0.1:8081
2022-11-22T09:57:17.903191857Z I1122 09:57:17.903133 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.906272505Z I1122 09:57:17.906224 1 builder.go:191] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
2022-11-22T09:57:17.917758187Z E1122 09:57:17.917560 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-22T09:57:17.917758187Z goroutine 24 [running]:
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.logPanic(

{0x1635600, 0x2696e10})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
2022-11-22T09:57:17.917758187Z panic({0x1635600, 0x2696e10}

)
2022-11-22T09:57:17.917758187Z /usr/lib/golang/src/runtime/panic.go:1038 +0x215
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x40)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(

{0x17fe520, 0xc00063b590})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x49
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x17fe520, 0xc00063b590}

)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
~~~

Logs are attached to the support case

Description of problem:

We have a pipeline to check if the sample-operator works well. But, I found the `oc import-image` commands always return 0 even if it fails. 

MacBook-Pro:~ jianzhang$ oc import-image mytestimage --from=quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f --confirm -n sample-test
error: tag  failed: Internal error occurred: quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: name unknown: repository not found
imagestream.image.openshift.io/mytestimage imported with errors


Name:			mytestimage
Namespace:		sample-test
Created:		4 minutes ago
Labels:			<none>
Annotations:		openshift.io/image.dockerRepositoryCheck=2022-11-30T02:53:29Z
Image Repository:	image-registry.openshift-image-registry.svc:5000/sample-test/mytestimage
Image Lookup:		local=false
Unique Images:		0
Tags:			1


latest
  tagged from quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f


  ! error: Import failed (InternalError): Internal error occurred: quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: name unknown: repository not found
      4 minutes ago


MacBook-Pro:~ jianzhang$ echo $?
0

Version-Release number of selected component (if applicable):

4.12

How reproducible:

always

Steps to Reproduce:

1. Install OCP 4.12
MacBook-Pro:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-11-29-131548   True        False         3h22m   Cluster version is 4.12.0-0.nightly-2022-11-29-131548

2. Import an non-exit image.
MacBook-Pro:~ jianzhang$ oc import-image mytestimage --from=quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f --confirm -n sample-test
error: tag  failed: Internal error occurred: quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: name unknown: repository not found
imagestream.image.openshift.io/mytestimage imported with errors


Name:			mytestimage
Namespace:		sample-test
Created:		4 minutes ago
Labels:			<none>
Annotations:		openshift.io/image.dockerRepositoryCheck=2022-11-30T02:53:29Z
Image Repository:	image-registry.openshift-image-registry.svc:5000/sample-test/mytestimage
Image Lookup:		local=false
Unique Images:		0
Tags:			1


latest
  tagged from quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f


  ! error: Import failed (InternalError): Internal error occurred: quay.io/openshifttest/busybox2@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: name unknown: repository not found
      4 minutes ago

3. Check the return code.

Actual results:

It returns the succeed code 0, why? Any reason? Thanks!

MacBook-Pro:~ jianzhang$ echo $?
0

Expected results:

MacBook-Pro:~ jianzhang$ echo $?
1

Additional info:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/34311/rehearse-34311-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.11-installer-rehearse-aws-c2s/1597474073825251328/artifacts/installer-rehearse-aws-c2s/set-sample-operator-disconnected/build-log.txt 

Running Command: oc import-image mytestimage --from=quay.io/openshifttest/busybox@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f --confirm -n sample-test
error: tag  failed: Internal error occurred: [ec2-54-162-188-130.compute-1.amazonaws.com:6001/openshifttest/busybox@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: Get "https://ec2-54-162-188-130.compute-1.amazonaws.com:6001/v2/": dial tcp 10.143.0.208:6001: connect: connection refused, quay.io/openshifttest/busybox@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 172.30.0.10:53: server misbehaving]
imagestream.image.openshift.io/mytestimage imported with errors

Name:			mytestimage
Namespace:		sample-test
Created:		Less than a second ago
Labels:			<none>
Annotations:		openshift.io/image.dockerRepositoryCheck=2022-11-29T07:31:50Z
Image Repository:	image-registry.openshift-image-registry.svc:5000/sample-test/mytestimage
Image Lookup:		local=false
Unique Images:		0
Tags:			1

latest
  tagged from quay.io/openshifttest/busybox@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f

  ! error: Import failed (InternalError): Internal error occurred: [ec2-54-162-188-130.compute-1.amazonaws.com:6001/openshifttest/busybox@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: Get "https://ec2-54-162-188-130.compute-1.amazonaws.com:6001/v2/": dial tcp 10.143.0.208:6001: connect: connection refused, quay.io/openshifttest/busybox@sha256:c5439d7db88ab5423999530349d327b04279ad3161d7596d2126dfb5b02bfd1f: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 172.30.0.10:53: server misbehaving]
      Less than a second ago

The fix for OCPBUGS-3382 ensures that we pass the proxy settings from the install-config through to the final cluster. However, nothing in the agent ISO itself uses proxy settings (at least until bootstrapping starts.

It is probably less likely for the agent-based installer that proxies will be needed than e.g. for assisted (where agents running on-prem need to call back to assisted-service in the cloud), but we should be consistent about using any proxy config provided. There may certainly be cases where the registry is only reachable via a proxy.

This can be easily set system-wide by configuring default environment variables in the systemd config. An example (from the bootstrap ignition) is: https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/etc/systemd/system.conf.d/10-default-env.conf.template
Note that current the agent service explicitly overrides these environment variables to be empty, so that will have to be cleared.

Description of problem:

the service ca controller start func seems to return that error as soon as its context is cancelled (which seems to happen the moment the first signal is received): https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f558246b8025584056/pkg/controller/starter.go#L24

that apparently triggers os.Exit(1) immediately https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f55824[…]om/openshift/library-go/pkg/controller/controllercmd/builder.go

the lock release doesn't happen until the periodic renew tick breaks out https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f55824[…]/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go

seems unlikely that you'd reach the call to le.release() before the call to os.Exit(1) in the other goroutine

Version-Release number of selected component (if applicable):

4.13.0

How reproducible:

~always

Steps to Reproduce:

1. oc delete -n openshift-service-ca pod <service-ca pod>

Actual results:

the old pod logs show:

W1103 09:59:14.370594       1 builder.go:106] graceful termination failed, controllers failed with error: stopped

and when a new pod comes up to replace it, it has to wait for a while before acquiring the leader lock

I1103 16:46:00.166173       1 leaderelection.go:248] attempting to acquire leader lease openshift-service-ca/service-ca-controller-lock...
 .... waiting ....
I1103 16:48:30.004187       1 leaderelection.go:258] successfully acquired lease openshift-service-ca/service-ca-controller-lock

Expected results:

new pod can acquire the leader lease without waiting for the old pod's lease to expire

Additional info:

 

Description of problem:

When we detect a refs/heads/branchname we should show the label as what we have now:

- Branch: branchname

And when we detect a refs/tags/tagname we should instead show the label as:

- Tag: tagname

I haven't implemented this in cli but there is an old issue for that here openshift-pipelines/pipelines-as-code#181

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

 

Steps to Reproduce:

1. Create a repository
2. Trigger the pipelineruns by push or pull request event on the github  

Actual results:

We do not show tag name even is tag is present instead of branch

Expected results:

We should show tag if tag is detected and branch if branch is detedcted.

Additional info:

https://github.com/openshift/console/pull/12247#issuecomment-1306879310

openshift-azure-routes.path has the following [Path] section:

[Path]
PathExistsGlob=/run/cloud-routes/*
PathChanged=/run/cloud-routes/
MakeDirectory=true

 

There was a change in systemd that re-checks the files watched with PathExistsGlob once the service finishes:

With this commit, systemd rechecks all paths specs whenever the triggered unit deactivates. If any PathExists=, PathExistsGlob= or DirectoryNotEmpty= predicate passes, the triggered unit is reactivated

 

This means that openshift-azure-routes will get triggered all the time as long there are files in /run/cloud-routes.

With CSISnapshot capability  disabled, the CSI driver operator are Degraded. For example:

18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource
AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: )
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):
4.12.nightly

The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.

CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.

Description of problem: This is a follow-up to OCPBUGS-3933.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

When creating SNO cluster with LVM enabled alongside pre-release 4.12 OCP version there is wrong subscription-name (odf-lvm-operator instead of  lvms-operator)

 

Steps to reproduce:

1. Create SNO cluster with pre-release OCP version (e.g. 4.12.0-rc.4)

2. Select "Install Logical Volume Manager Storage"

3. Get cluster info and check the "subscription_name" under LVM operator

 

Actual results:

 

{            
    "cluster_id": "e0bceccf-1023-4ac0-986b-17fc9363a059",
    "name": "lvm",
    "namespace": "openshift-storage",
    "operator_type": "olm",
    "status_updated_at": "0001-01-01T00:00:00.000Z",
    "subscription_name": "odf-lvm-operator",
    "timeout_seconds": 1800
} 

 

 

Expected results:

{            
    "cluster_id": "0180b501-b0aa-4413-a55c-72ace8cc0915",
    "name": "lvm",
    "namespace": "openshift-storage",
    "operator_type": "olm",
    "status_updated_at": "0001-01-01T00:00:00.000Z",
    "subscription_name": "lvms-operator",
    "timeout_seconds": 1800
} 

Description of problem:

Image registry pods panic while deploying OCP in ap-southeast-4 AWS region

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Deploy OCP in AWS ap-southeast-4 region

Steps to Reproduce:

Deploy OCP in AWS ap-southeast-4 region 

Actual results:

panic: Invalid region provided: ap-southeast-4

Expected results:

Image registry pods should come up with no errors

Additional info:

 

 

 

 

At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.

But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.

However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.

This was probably not detected in QE as well for the same reason as CI.

Description of problem:

Create install-config file for vsphere IPI against 4.12.0-0.nightly-2022-09-02-194931, fail as apiVIP and ingressVIP are not in machine CIDR.

$ ./openshift-install create install-config --dir ipi                
? Platform vsphere
? vCenter xxxxxxxx
? Username xxxxxxxx
? Password [? for help] ********************
INFO Connecting to xxxxxxxx
INFO Defaulting to only available datacenter: SDDC-Datacenter 
INFO Defaulting to only available cluster: Cluster-1 
INFO Defaulting to only available datastore: WorkloadDatastore 
? Network qe-segment
? Virtual IP Address for API 172.31.248.137
? Virtual IP Address for Ingress 172.31.248.141
? Base Domain qe.devcluster.openshift.com 
? Cluster Name jimavmc       
? Pull Secret [? for help] ****************************************************************************************************************************************************************************************
FATAL failed to fetch Install Config: failed to generate asset "Install Config": invalid install config: [platform.vsphere.apiVIPs: Invalid value: "172.31.248.137": IP expected to be in one of the machine networks: 10.0.0.0/16, platform.vsphere.ingressVIPs: Invalid value: "172.31.248.141": IP expected to be in one of the machine networks: 10.0.0.0/16] 

As user could not define cidr for machineNetwork when creating install-config file interactively, it will use default value 10.0.0.0/16, so fail to create install-config when inputting apiVIP and ingressVIP outside of default machinenNetwork.

Error is thrown from https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L655-L666, seems new function introduced from PR https://github.com/openshift/installer/pull/5798

The issue should also impact Nutanix platform.
 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-02-194931

How reproducible:

Always

Steps to Reproduce:

1. create install-config.yaml file by running command "./openshift-install create install-config --dir ipi"
2. failed with above error
3.

Actual results:

fail to create install-config.yaml file

Expected results:

succeed to create install-config.yaml file

Additional info:

 

As a user of the HyperShift CLI, I would like to be able to set the NodePool UpgradeType through a flag when either creating a new cluster or creating a new NodePool.


DoD:

  • A flag has been added to the create new cluster command allowing the NodePool UpgradeType to be set to either Replace or InPlace
  • A flag has been added to the create new NodePool command allowing the NodePool UpgradeType to be set to either Replace or InPlace
  • If either flag is not set, the default will be Replace as that is the current default

This is a clone of issue OCPBUGS-10478. The following is the description of the original issue:

Description of problem:

Fail to collect the vm serial log with ‘openshift-install gather bootstrap’

Version-Release number of selected component (if applicable):

 4.13.0-0.nightly-2023-03-14-053612

How reproducible:

Always

Steps to Reproduce:

1.IPI install a private cluster, Once bootstrap node boot up, before it is terminated,
2. ssh to the bastion, then try to get bootstrap log 
$openshift-install gather bootstrap --key openshift-qe.pem --bootstrap 10.0.0.5 --master 10.0.0.7 –loglevel debug
3.

Actual results:

Fail to get the vm serial logs, in the output:
…
DEBUG Gather remote logs                           
DEBUG Collecting info from 10.0.0.6                
DEBUG scp: ./installer-masters-gather.sh: Permission denied 
 EBUG Warning: Permanently added '10.0.0.6' (ECDSA) to the list of known hosts.…DEBUG Waiting for logs ...                         
DEBUG Log bundle written to /var/home/core/log-bundle-20230317033401.tar.gz 
WARNING Unable to stat /var/home/core/serial-log-bundle-20230317033401.tar.gz, skipping 
INFO Bootstrap gather logs captured here "/var/home/core/log-bundle-20230317033401.tar.gz"

Expected results:

Get the vm serial log and in the log has not the above “WARNING  Unable to stat…”

Additional info:

IPI install on local install, has the same issue.
INFO Pulling VM console logs                     
DEBUG attemping to download                       
…                       
INFO Failed to gather VM console logs: unable to download file: /root/temp/4.13.0-0.nightly-2023-03-14-053612/ipi/serial-log-bundle-20230317042338

Description of problem:

When an ImageManifestVuln is manually created, some properties refrenced on the ImageManifestVuln list and details pages will be undefined, which causes those components to throw a runtime error.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Install Red Hat Quay Container Security operator
2. From the operator details page, Under the "Provided APIs" section, click "Create instance" in the ImageManifestVuln card.
3. Fill in the creation form with any "name" and "image" value and click "Create"

Actual results:

A runtime error is encountered on the ImageManifesVuln list and details pages for the created resource.

Expected results:

The ImageManifestVuln list and details pages should render the newly created resource without runtime errors.

Additional info:

 

Description of the problem:

In recent prow jobs, we're seeing git failing to apply configurations on git repos that it doesn't have the "right" user ownership. It is because those files are owned by root whereas actual runtime user is usually a different one.

Examples:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/4770/pull-ci-openshift-assisted-service-master-edge-unit-test/1605545643449782272
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/4788/pull-ci-openshift-assisted-service-release-ocm-2.6-verify-generated-code/1605544917893910528

The phenomenon seen is either:

fatal: detected dubious ownership in repository at '/assisted-service' 

or:

fatal: not in a git directory 

How reproducible:

100%

Description of problem:

Custom manifest files can be placed in the /openshift folder so that they will be applied during cluster installation.
Anyhow, if a file contains more than one manifests, all but the first are ignored.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create the following custom manifest file in the /openshift folder:

```
apiVersion: v1
kind: ConfigMap
metadata:  
  name: agent-test  
  namespace: openshift-config
data:  
  value: agent-test
---
apiVersion: v1
kind: ConfigMap
metadata: 
name: agent-test-2
namespace: openshift-config
data: 
  value: agent-test-2
```
2. Create the agent ISO image and deploy a cluster

Actual results:

ConfigMap agent-test-2 does not exist in the openshift-config namespace

Expected results:

ConfigMap agent-test-2 must exist in the openshift-config namespace

Additional info:

 

This is a clone of issue OCPBUGS-9964. The following is the description of the original issue:

Description of problem:

egressip cannot be assigned on hypershift hosted cluster node

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-09-162945

How reproducible:

100%

Steps to Reproduce:

1. setup hypershift env


2. lable egress ip node on hosted cluster
% oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-129-175.us-east-2.compute.internal   Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-129-244.us-east-2.compute.internal   Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-141-41.us-east-2.compute.internal    Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-142-54.us-east-2.compute.internal    Ready    worker   3h20m   v1.26.2+bc894ae

% oc label node/ip-10-0-129-175.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-129-175.us-east-2.compute.internal labeled
% oc label node/ip-10-0-129-244.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-129-244.us-east-2.compute.internal labeled
% oc label node/ip-10-0-141-41.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-141-41.us-east-2.compute.internal labeled
% oc label node/ip-10-0-142-54.us-east-2.compute.internal  k8s.ovn.org/egress-assignable=""
node/ip-10-0-142-54.us-east-2.compute.internal labeled


3. create egressip
% cat egressip.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip-1
spec:
  egressIPs: [ "10.0.129.180" ]
  namespaceSelector:
    matchLabels:
      env: ovn-tests
% oc apply -f egressip.yaml 
egressip.k8s.ovn.org/egressip-1 created


4. check egressip assignment
             

Actual results:

egressip cannot assigned to node
% oc get egressip NAME         EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS egressip-1   10.0.129.180 

Expected results:

egressip can be assigned to one of the hosted cluster node

Additional info:

 

Description of problem:

On gcp private cluster, when delete controlplanemachineset it's stuck forever, logs report "Required value: targetPools is required for control plane machines"

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-27-165107

How reproducible:

Always

Steps to Reproduce:

1. Delete controlplanemachineset
$ oc delete controlplanemachineset cluster -n openshift-machine-api     
controlplanemachineset.machine.openshift.io "cluster" deleted
^
2. Check log
E0129 04:37:31.940682       1 controller.go:326]  "msg"="Reconciler error" "error"="error reconciling control plane machine set: failed to update control plane machine set: admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.targetPools: Required value: targetPools is required for control plane machines" "controller"="controlplanemachineset" "reconcileID"="6acec245-1d2d-4643-b45c-69517d8ce93e" 
3.

Actual results:

Couldn't delete cpms

Expected results:

Delete cpms successful.  

Additional info:

There is no targetPools field in master machine yaml file for private cluster, seems no need to check targetPools for private cluster.
template: versioned-installer-restricted_network-private_cluster

Description of problem:

The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.25 packages. OpenShift 4.13 is based on Kubernetes 1.26.   

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.13/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.25

Expected results:

Kubernetes packages are at version v0.26.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.
Also, Gateway-API is dependent on v0.26, so we are required to bump in order to support our Enhanced Dev Preview activities.

This is a clone of issue OCPBUGS-12729. The following is the description of the original issue:

Description of problem:

This came out of the investigation of https://issues.redhat.com/browse/OCPBUGS-11691 . The nested node configs used to support dual stack VIPs do not correctly respect the EnableUnicast setting. This is causing issues on EUS upgrades where the unicast migration cannot happen until all nodes are on 4.12. This is blocking both the workaround and the eventual proper fix.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Deploy 4.11 with unicast explicitly disabled (via MCO patch)
2. Write /etc/keepalived/monitor-user.conf to suppress unicast migration
3. Upgrade to 4.12

Actual results:

Nodes come up in unicast mode

Expected results:

Nodes remain in multicast mode until monitor-user.conf is removed

Additional info:

 

In 4.12.0-rc.0 some API-server components declare flowcontrol/v1beta1 release manifests:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.12.0-rc.0-x86_64
$ grep -r flowcontrol.apiserver.k8s.io manifests
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_etcd-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-controller-manager-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1

The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. This ticket tracks removing those manifests, or replacing them with a more modern resource type, or some such. Definition of done is that new 4.13 (and with backports, 4.12) nightlies no longer include flowcontrol.apiserver.k8s.io/v1beta1 manifests.

This can be noticed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27560/pull-ci-openshift-origin-master-e2e-gcp-ovn/1593697975584952320/artifacts/e2e-gcp-ovn/openshift-e2e-test/build-log.txt:

[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:27
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:158
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:159
flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Ginkgo exit error 4: exit with code 4

This is required to unblock https://github.com/openshift/origin/pull/27561

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/862

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/coredns/pull/83

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-12780. The following is the description of the original issue:

Description of problem:

023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health [-] Component KuryrPortHandler is dead. Last caught exception below: openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 169, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self.k8s.get(f"{constants.K8S_API_NAMESPACES}"
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 121, in get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._raise_from_response(response)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 99, in _raise_from_response
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     raise exc.K8sResourceNotFound(response.text)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \\"mygov-tuo-microservice-dev2-59fffbc58c-l5b79\\" not found","reason":"NotFound","details":{"name":"mygov-tuo-microservice-dev2-59fffbc58c-l5b79","kind":"pods"},"code":404}\n'
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health During handling of the above exception, another exception occurred:
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 38, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 85, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, retry_info=info, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 98, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self.on_finalize(obj, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 184, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self._mock_cleanup_pod(kuryrport_crd)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 160, in _mock_cleanup_pod
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     host_ip = utils.get_parent_port_ip(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 830, in get_parent_port_ip
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     parent_port = os_net.get_port(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1987, in get_port
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return self._get(_port.Port, port)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 48, in check
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return method(self, expected, actual, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 513, in _get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     resource_type=resource_type.__name__, value=value))
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1472, in fetch
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_base.py", line 26, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path, params=params)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1156, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     "Request requires an ID but none was found")
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.918 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping
2023-04-20 02:08:09.919 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworks'
2023-04-20 02:08:10.026 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/machine.openshift.io/v1beta1/machines'
2023-04-20 02:08:10.152 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/pods'
2023-04-20 02:08:10.174 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/networking.k8s.io/v1/networkpolicies'
2023-04-20 02:08:10.857 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/namespaces'
2023-04-20 02:08:10.877 1 WARNING kuryr_kubernetes.controller.drivers.utils [-] Namespace dev-health-air-ids not yet ready: kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnetworks.openstack.org \\"dev-health-air-ids\\" not found","reason":"NotFound","details":{"name":"dev-health-air-ids","group":"openstack.org","kind":"kuryrnetworks"},"code":404}\n'
2023-04-20 02:08:11.024 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/services'
2023-04-20 02:08:11.078 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/endpoints'
2023-04-20 02:08:11.170 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrports'
2023-04-20 02:08:11.344 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworkpolicies'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrloadbalancers'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] No remaining active watchers, Exiting...
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Create a pod.
2. Stop kuryr-controller.
3. Delete the pod and the finalizer on it.
4. Delete pod's subport.
5. Start the controller.

Actual results:

Crash

Expected results:

Port cleaned up normally.

Additional info:


Description of problem:

The image provided as an example to deploy a statefulset from the UI is amd64 only, leading the deploy of an example statefulset to Pods in CrashLoopBackOff.

Version-Release number of selected component (if applicable):

4.11 4.12 (also having other issues currently), needs verification on 4.10

How reproducible:

Always

Steps to Reproduce:

1. Go to the Administrator Console
2. Open the Workloads Pane
3. Click StatefuleSets
4. Create a new StatefulSet with the provided example image

Actual results:

Exec format error ## (the image, 'gcr.io/google_containers/nginx-slim:0.8', is amd64 only)

Expected results:

The StatefulSet is deployed, and the pods are ready

Additional info:

Should we set it to be image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest or another sample?

Please review the following PR: https://github.com/openshift/kubernetes/pull/1435

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/1656

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-8444. The following is the description of the original issue:

Description of problem:

In OCP 4.13, we are moving to RHEL9 but explicitly setting cgroups to v1 by default for all clusters. In the original PR to do so, we generate an extra MachineConfig, will may cause unnecessary issues in the future.

We should set it via the base configs, as also noted in https://issues.redhat.com/browse/OCPNODE-1495

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Install 4.13
2. 
3.

Actual results:

Observe that 97-generated-kubelet MCs exist for master/worker

Expected results:

No extra MC is needed

Additional info:

 

Description of problem:

In cluster-ingress-operator's ensureNodePortService, when there is a conflicting Ingress Controller loadbalancer, it states:

a conflicting load balancer service exists that is not owned by the ingress controller: openshift-ingress/router-loadbalancer

Technically that is the service name, not the ingress controller name. The IC name is openshift-ingress/loadbalancer in this example.

So the error message wording is incorrect.

Version-Release number of selected component (if applicable):

4.13
4.12
4.11

How reproducible:

Easy

Steps to Reproduce:

# Create a service that will conflict with a new ingress controller
oc create svc nodeport router-nodeport-test --tcp=80 -n openshift-ingress
DOMAIN=$(oc get ingresses.config/cluster -o jsonpath={.spec.domain})
oc apply -f - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: test
  namespace: openshift-ingress-operator
spec:
  domain: reproducer.$DOMAIN
  endpointPublishingStrategy:
    type: NodePortService
  replicas: 1
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
EOF

# Look for log message that is incorrect
oc logs -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') -c ingress-operator | grep conflicting 

# The results provide service name, not ingress controller name
# "error": "a conflicting nodeport service exists that is not owned by the ingress controller: openshift-ingress/router-test"  

Actual results:

"error": "a conflicting nodeport service exists that is not owned by the ingress controller: openshift-ingress/router-test"

Expected results:

"error": "a conflicting nodeport service exists that is not owned by the ingress controller: openshift-ingress/router-nodeport-test"

Additional info:

 

Description of problem:

The default upgrade channel on OCP 4.13 cluster is stable-4.12. It should be  stable-4.13.

# oc adm upgrade 
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.13.0-ec.3 not found in the "stable-4.12" channel

Cluster version is 4.13.0-ec.3

Upstream is unset, so the cluster will use an appropriate default.

Channel: stable-4.12

Version-Release number of selected component (if applicable):

4.13.0-ec.3

How reproducible:

1/1

Steps to Reproduce:

1. Install a 4.13 cluster
2. Check clusterversion.spec.channel
3.

Actual results:

The default channel on 4.13 cluster is stable-4.12

Expected results:

Expect the default channel to be stable-4.13

Additional info:

 

This is a clone of issue OCPBUGS-8523. The following is the description of the original issue:

Description of problem:

Due to rpm-ostree regression (OKD-63) MCO was copying /var/lib/kubelet/config.json into /run/ostree/auth.json on FCOS and SCOS. This breaks Assisted Installer flow, which starts with Live ISO and doesn't have /var/lib/kubelet/config.json

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-8676. The following is the description of the original issue:

When implementing support for IPv6-primary dual-stack clusters, we have extended the available IP families to

const (
	IPFamiliesIPv4                 IPFamiliesType = "IPv4"
	IPFamiliesIPv6                 IPFamiliesType = "IPv6"
	IPFamiliesDualStack            IPFamiliesType = "DualStack"
	IPFamiliesDualStackIPv6Primary IPFamiliesType = "DualStackIPv6Primary"
)

At the same time definitions of kubelet.service systemd unit still contain the code

{{- if eq .IPFamilies "DualStack"}}
        --node-ip=${KUBELET_NODE_IPS} \
{{- else}}
        --node-ip=${KUBELET_NODE_IP} \
{{- end}}

which only matches the "old" dual-stack family. Because of this, an IPv6-primary dual-stack renders node-ip param with only 1 IP address instead of 2 as required in dual-stack.

This is a clone of issue OCPBUGS-8381. The following is the description of the original issue:

Derscription of problem:

On a hypershift cluster that has public certs for OAuth configured, the console reports a x509 certificate error when attempting to display a token

Version-Release number of selected component (if applicable):

4.12.z

How reproducible:

always

Steps to Reproduce:

1. Create a hosted cluster configured with a letsencrypt certificate for the oauth endpoint.
2. Go to the console of the hosted cluster. Click on the user icon and get token.

Actual results:

The console displays an oauth cert error

Expected results:

The token displays

Additional info:

The hcco reconciles the oauth cert into the console namespace. However, it is only reconciling the self-signed one and not the one that was configured through .spec.configuration.apiserver of the hostedcluster. It needs to detect the actual cert used for oauth and send that one.

 

Description of the problem:

In Staging, UI 2.13.2 + BE v2.13.4, Creating new SNO cluster with 4.12 (4.12.0-rc.7) and selecting arm64 CPU arch. . OLM operators should be disabled, but LVMS enabled

How reproducible:

100%

Steps to reproduce:

1. create new cluster.

2. Select SNO and OCP 4.12

3. select arm64

4. In operators page, LVMS is enabled

Actual results:

 

Expected results:

Description of the problem:
Currently, nics are filtered from connectivity checks if they don't have any addresses. For ARPING checks, any nic that doesn't have a valid IPv4 address should not be used. For the ARPING check, link-local IPv4 address is not considered valid.
 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

 Sometimes controller can fail to stream tar.gz of must-gather and no files will be send as we will close write pipe and will not try to stream another tar entry

This is a clone of issue OCPBUGS-10570. The following is the description of the original issue:

What happens:

When deploying OpenShift 4.13 with Failure Domains, the PrimarySubnet in the ProviderSpec of the Machine is  set to the MachinesSubnet set in install-config.yaml.

 

What is expected:

Machines in failure domains with a control-plane port target should not use the MachinesSubnet as a primary subnet in the provider spec. it should be the ID of the subnet that is actually used for the control plane on that domain.

 

How to reproduce:

install-config.yaml:

apiVersion: v1
baseDomain: shiftstack.com
compute:
- name: worker
  platform:
    openstack:
      type: m1.xlarge
  replicas: 1
controlPlane:
  name: master
  platform:
    openstack:
      type: m1.xlarge
      failureDomains:
      - portTargets:
        - id: control-plane
          network:
            id: fb6f8fea-5063-4053-81b3-6628125ed598
          fixedIPs:
          - subnet:
              id: b02175dd-95c6-4025-8ff3-6cf6797e5f86
        computeAvailabilityZone: nova-az1
        storageAvailabilityZone: cinder-az1
      - portTargets:
        - id: control-plane
          network:
            id: 9a5452a8-41d9-474c-813f-59b6c34194b6
          fixedIPs:
          - subnet:
              id: 5fe5b54a-217c-439d-b8eb-441a03f7636d
        computeAvailabilityZone: nova-az1
        storageAvailabilityZone: cinder-az1
      - portTargets:
        - id: control-plane
          network:
            id: 3ed980a6-6f8e-42d3-8500-15f18998c434
          fixedIPs:
          - subnet:
              id: a7d57db6-f896-475f-bdca-c3464933ec02
        computeAvailabilityZone: nova-az1
        storageAvailabilityZone: cinder-az1
  replicas: 3
metadata:
  name: mycluster
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.168.10.0/24
  - cidr: 192.168.20.0/24
  - cidr: 192.168.30.0/24
  - cidr: 192.168.72.0/24
  - cidr: 192.168.100.0/24
platform:
  openstack:
    cloud: foch_openshift
    machinesSubnet: b02175dd-95c6-4025-8ff3-6cf6797e5f86
    apiVIPs:
    - 192.168.100.240
    ingressVIPs:
    - 192.168.100.250
    loadBalancer:
      type: UserManaged
featureSet: TechPreviewNoUpgrade

Machine spec:

  Provider Spec:
    Value:
      API Version:  machine.openshift.io/v1alpha1
      Cloud Name:   openstack
      Clouds Secret:
        Name:       openstack-cloud-credentials
        Namespace:  openshift-machine-api
      Flavor:       m1.xlarge
      Image:        foch-bgp-2fnjz-rhcos
      Kind:         OpenstackProviderSpec
      Metadata:
        Creation Timestamp:  <nil>
      Networks:
        Filter:
        Subnets:
          Filter:
            Id:        5fe5b54a-217c-439d-b8eb-441a03f7636d
        Uuid:          9a5452a8-41d9-474c-813f-59b6c34194b6
      Primary Subnet:  b02175dd-95c6-4025-8ff3-6cf6797e5f86
      Security Groups:
        Filter:
        Name:  foch-bgp-2fnjz-master
        Filter:
        Uuid:             1b142123-c085-4e14-b03a-cdf5ef028d91
      Server Group Name:  foch-bgp-2fnjz-master
      Server Metadata:
        Name:                  foch-bgp-2fnjz-master
        Openshift Cluster ID:  foch-bgp-2fnjz
      Tags:
        openshiftClusterID=foch-bgp-2fnjz
      Trunk:  true
      User Data Secret:
        Name:  master-user-data
Status:
  Addresses:
    Address:  192.168.20.20
    Type:     InternalIP
    Address:  foch-bgp-2fnjz-master-1
    Type:     Hostname
    Address:  foch-bgp-2fnjz-master-1
    Type:     InternalDNS 

The machine is connected to the right subnet, but has a wrong PrimarySubnet configured.

Description of problem:

When trying to add a Cisco UCS Rackmount server as a `baremetalhost` CR the following error comes up in the metal3 container log in the openshift-machine-api namespace.

'TransferProtocolType' property which is mandatory to complete the action is missing in the request body

Full log entry:

{"level":"info","ts":1677155695.061805,"logger":"provisioner.ironic","msg":"current provision state","host":"ucs-rackmounts~ocp-test-1","lastError":"Deploy step deploy.deploy failed with BadRequestError: HTTP POST https://10.5.4.78/redfish/v1/Managers/CIMC/VirtualMedia/0/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.4.0.GeneralError: 'TransferProtocolType' property which is mandatory to complete the action is missing in the request body. Extended information: [{'@odata.type': 'Message.v1_0_6.Message', 'MessageId': 'Base.1.4.0.GeneralError', 'Message': "'TransferProtocolType' property which is mandatory to complete the action is missing in the request body.", 'MessageArgs': [], 'Severity': 'Critical'}].","current":"deploy failed","target":"active"}

Version-Release number of selected component (if applicable):

    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:30328143480d6598d0b52d41a6b755bb0f4dfe04c4b7aa7aefd02ea793a2c52b
    imagePullPolicy: IfNotPresent
    name: metal3-ironic

How reproducible:

Adding a Cisco UCS Rackmount with Redfish enabled as a baremetalhost to metal3

Steps to Reproduce:

1. The address to use: redfish-virtualmedia://10.5.4.78/redfish/v1/Systems/WZP22100SBV

Actual results:

[baelen@baelen-jumphost mce]$ oc get baremetalhosts.metal3.io  -n ucs-rackmounts  ocp-test-1
NAME         STATE          CONSUMER   ONLINE   ERROR                AGE
ocp-test-1   provisioning              true     provisioning error   23h

Expected results:

For the provisioning to be successfull.

Additional info:

 

This is a clone of issue OCPBUGS-14125. The following is the description of the original issue:

Description of problem:

Since registry.centos.org is closed, tests relying on this registry in e2e-agnostic-ovn-cmd job are failing.

Version-Release number of selected component (if applicable):

all

How reproducible:

Trigger e2e-agnostic-ovn-cmd job

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Test located at github.com/openshift/origin/test/extended/apiserver/api_requests.go:449 is failing '[It] operators should not create watch channels very often [Suite:openshift/conformance/parallel]':

'"Operator \"cluster-monitoring-operator\" produces more watch requests than expected: watchrequestcount=115, upperbound=112, ratio=1.0267857142857142",'


Version-Release number of selected component (if applicable):

4.13

How reproducible:

Found in 0.02% of runs (0.11% of failures) across 23061 total runs and 2252 jobs (15.65% failed) in 11.807s [1]

[1] https://search.ci.openshift.org/?search=cluster-monitoring-operator%5C%5C%22+produces+more+watch+requests+than+expected&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

1. Unknown
2.
3.

Actual results:

CMO operator exceeds watch request limit

Expected results:

CMO operator doesn't exceed watch request limit

Additional info:

 

This is a clone of issue OCPBUGS-5940. The following is the description of the original issue:

Description of problem:

Tests Failed.expand_lesslogs in as 'test' user via htpasswd identity provider: Auth test logs in as 'test' user via htpasswd identity provider

 CI-search
Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

During the upgrade of build02, a worker node was unavailable.  One of the monitoring operator's daemonsets failed to fully rollout as a result (one of the pods never started running, since the node wasn't available).  This meant the monitoring operator never achieved the new level, thereby blocking the upgrade.

see:
https://coreos.slack.com/archives/C03G7REB4JV/p1663698229312909?thread_ts=1663676443.155839&cid=C03G7REB4JV

and the full upgrade post mortem:
https://docs.google.com/document/d/1N5ulciLzGHq09ouEWObGXz7iDmPmhdM6walZur1ZRbs/edit#

 

Version-Release number of selected component (if applicable):

4.12 ec to ec upgrade

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster w/ an unavailable node (shutdown the node in the cloud provider.  Machineapi at least right now (it's being addressed) will end up reporting the node as unavailable, but not removing it or restarting it)
2. Upgrade the cluster
3. See that the upgrade gets stuck on the monitoring operator

Actual results:

upgrade gets stuck until the unavailable node is deleted or fixed

Expected results:

upgrade completes

Additional info:

Miciah Masters had some suggestions on how the operator can better handle determining if it has achieved the new level, in the face of these sorts of situation.  The DNS operator appears to handle this properly (it also runs a daemonset w/ pods expected on all nodes in the cluster).

DoD:

Let the HO export a metric with it own version so as an SRE I can easily understand which version is running where by looking at a grafana dashboard.

Since we no longer support instaling 4.6, there's no need to check the version and networkType and no need to patch etcd and number of control plane nodes.

This is a clone of issue OCPBUGS-8713. The following is the description of the original issue:

Description of problem:

Currently, the hypershift namespace servicemonitor has the api group rhobs.com, which results in the hypershift operator metrics not being scraped by CMO. 
Additionally, the hypershift deployment comes with recording rules that require metrics from the CMO. 

The servicemonitor's apigroup needs to be changed back to `coreos.com` for the following reasons:
- future observability of the hypershift operator by the CMO, as we do for other operators
- functional recording rules (https://github.com/openshift/hypershift/blob/main/cmd/install/assets/recordingrules/hypershift.yaml)

Version-Release number of selected component (if applicable):

4.13.4

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

See slack thread: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1678289562185509?thread_ts=1677690924.397839&cid=C04EUL1DRHC

Description of problem:
Service list has a Location column that shows the used ports of the service. But it shows undefined:80 (port from the resource) when the Service has type: ExternalName or type: LoadBalancer

Version-Release number of selected component (if applicable):
4.6-4.12

How reproducible:
Always

Steps to Reproduce:
Create a Service with type: ExternalName or type: LoadBalancer, for example:

apiVersion: v1
kind: Service
metadata:
  name: external-service
spec:
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  selector:
    app: nodeinfo-d
  type: ExternalName
---
kind: Service
apiVersion: v1
metadata:
  name: loadbalancer-service
spec:
  clusterIP: 10.217.4.147
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
      nodePort: 31799
  type: LoadBalancer
  selector:
    app: nodeinfo-d

Open Administrator perspective > Networking > Services

Actual results:
Service list shows undefined:80 (port from the resource)

Expected results:
Service list should show just the port or more information for other Service types.

Additional info:
The Service details page uses a switch for .spec.type to show different information. See https://github.com/openshift/console/blob/7b741dd8898454cded10f6572fbebb1c4df1c8f9/frontend/public/components/service.jsx#L97-L125

Description of problem:

documentationBaseURL still points to 4.12 URL on a 4.13 cluster

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-12-07-193721

How reproducible:

Always

Steps to Reproduce:

1. check documentationBaseURL on a 4.13 cluster
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-12-07-193721   True        False         37m     Cluster version is 4.13.0-0.nightly-2022-12-07-193721 
$ oc get cm console-config -n openshift-console -o yaml | grep documentationBaseURL       documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.12/
2.
3.

Actual results:

it still points to 4.12 

Expected results:

documentationBaseURL should be updated to  https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/

Additional info:

 

 

Description of problem:

The dependency on library-go from openshift/kubernetes:release-4.13 must be recent enough to include https://github.com/openshift/library-go/commit/d679fe6f824818b04acb36524917c7362de6b81e.

Version-Release number of selected component (if applicable):

4.13

Additional info:

This issue tracks only the necessary dependency bump and no behavior impact.

The failure says failed to generate asset "Image": multiple "disk" artifacts found, and when digging around (thanks @prashanths for the help) we've discovered that the error is being thrown from this function (link: https://github.com/openshift/installer/blame/master/pkg/rhcos/builds.go#L80) now that the rcos.json file includes the secure execution qemu image (link: https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json#L411).  Now that there are two images, we need to choose the correct image for the installer.  

Description of the problem:
If the host install progress can't be updated because the current state and new state are incompatible in some way the service should return a 4xx error (409 - conflict, perhaps) rather than a 500 error.

We recently saw a spike in 500 errors in metrics while investigating a cluster failure (https://issues.redhat.com/browse/AITRIAGE-4419). It turns out that this was not the service malfunctioning (as we had originally thought), but was a few hosts trying to update their install status from error to rebooting.
 

How reproducible:
100%
 

Steps to reproduce:

1. Have a cluster in error

2. Attempt to update a host from error state to rebooting

Actual results:

Service returns a 500 error.

Expected results:

Service returns a 409 error.

Description of problem:

GCP XPN is in tech preview. There are two features which are affected:
1. selecting a DNS zone from a different project should only be allowed if tech preview is enabled in the install config. (Using a DNS zone from a different project will fail to install due to outstanding work in the cluster ingress operator). 
2. GCP XPN passes through the installer host service account for control plane nodes. This should only happen if XPN (networkProjectID) is enabled. It should not happen during normal installs.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

For install config fields:
1.specify a project ID for a DNS zone without featureSet: TechPreviewNoUpgrade
2.run openshift-install create manifests
====
For service accounts:
1. perform normal (not XPN) install
2. Check service account on control plane VM

 

Actual results:

For install config fields: you can specify project id without an error
For service accounts: the control plane vm will have same service account used for install

Expected results:

For install config fields: installer should complain that tech preview is not enabled
For service accounts: should have a new service account, created during install

Additional info:

 

Description of problem:

Cannot scale up worker node have deploying OCP 4.12 cluster via UPI on Azure Stack Hub

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-12-27-111646

How reproducible:

Always

Steps to Reproduce:

1.Following the step described in the document: https://github.com/openshift/installer/blob/master/docs/user/openstack/install_upi.md create cluster 
2. in https://github.com/openshift/installer/blob/master/docs/user/openstack/install_upi.md#remove-machines-and-machinesets only control plane machines manifests were removed, worker machines manifests remain untouched.After three masters and three worker nodes were created by ARM templates.
3. additional worker were added using machine sets via command
$oc scale  --replicas=1 machineset  maxu-stack0-nsqhm-worker-mtcazs  -n openshift-machine-api

Actual results:

$oc describe machine  maxu-stack0-nsqhm-worker-mtcazs-2gjsl -n openshift-machine-api
Message:               failed to create vm maxu-stack0-nsqhm-worker-mtcazs-2gjsl: failure sending request for machine maxu-stack0-nsqhm-worker-mtcazs-2gjsl: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="NotFound" Message="The Availability Set '/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/maxu-stack0-nsqhm-rg/providers/Microsoft.Compute/availabilitySets/maxu-stack0-nsqhm-cluster' cannot be found."

Expected results:

The installer should be able to create and manage machineset

Additional info:

1. https://issues.redhat.com/browse/OCPBUGS-4405
2. After the following changes of the arm files as the following, resolve this issue.
"imageName" : "[parameters('baseName')]", 
"masterAvailabilitySetName" : "[concat(parameters('baseName'), '-cluster')]",

 

We need to see how many 4.13 clusters opted in for vSphere CSI migration in telemetry, so we can predict issues in 4.13 -> 4.14 upgrade (where CSI migration will be forced without opt-out).

 

SaaS Staging: Cluster preparation for installation stuck for 40 minutes.
It happened after assisted-service update on Staging
Installation resume after 40 minutes.

Events:
11/7/2022, 11:28:24 AM Updated status of the cluster to preparing-for-installation
11/7/2022, 11:50:10 AM warning Preparing for installation was timed out for the cluster
11/7/2022, 12:10:10 PM warning Preparing for installation was timed out for the cluster
11/7/2022, 12:12:19 PM Updated status of the cluster to installing

assisted-service log:

time="2022-11-07T16:28:24Z" level=info msg="cluster f63981b1-88bc-4cec-ab9f-8ff2a6787383 has been updated with the following updates [status preparing-for-installation status_info Preparing cluster for installation install_started_at 2022-11-07T16:28:24.593Z installation_preparation_completion_status  logs_info  controller_logs_started_at 0001-01-01T00:00:00.000Z controller_logs_collected_at 0001-01-01T00:00:00.000Z status_updated_at 2022-11-07T16:28:24.594Z trigger_monitor_timestamp 2022-11-07 16:28:24.59401028 +0000 UTC m=+1491.787907904]" func=github.com/openshift/assisted-service/internal/cluster.updateClusterStatus file="/assisted-service/internal/cluster/common.go:73" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=11780 pkg=cluster-state request_id=ca7c1574-f385-47d8-93ca-85b3012d6bac

time="2022-11-07T16:28:24Z" level=info msg="No ImageContentSources in install-config to build ICSP file" func=github.com/openshift/assisted-service/internal/ignition.getIcspFileFromInstallConfig file="/assisted-service/internal/ignition/ignition.go:1608" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=11953 request_id=

time="2022-11-07T16:50:42Z" level=info msg="cluster f63981b1-88bc-4cec-ab9f-8ff2a6787383 has been updated with the following updates [status preparing-for-installation status_info Preparing cluster for installation install_started_at 2022-11-07T16:50:42.232Z installation_preparation_completion_status  logs_info  controller_logs_started_at 0001-01-01T00:00:00.000Z controller_logs_collected_at 0001-01-01T00:00:00.000Z status_updated_at 2022-11-07T16:50:42.232Z trigger_monitor_timestamp 2022-11-07 16:50:42.232163698 +0000 UTC m=+2829.426061322]" func=github.com/openshift/assisted-service/internal/cluster.updateClusterStatus file="/assisted-service/internal/cluster/common.go:73" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=26070 pkg=cluster-state request_id=b8bdbf64-0f86-4013-9c7f-4fe769b64769

time="2022-11-07T16:50:42Z" level=info msg="No ImageContentSources in install-config to build ICSP file" func=github.com/openshift/assisted-service/internal/ignition.getIcspFileFromInstallConfig file="/assisted-service/internal/ignition/ignition.go:1608" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=26429 request_id=


time="2022-11-07T17:03:16Z" level=warning msg="No bootstrap found in cluster f63981b1-88bc-4cec-ab9f-8ff2a6787383" func=github.com/openshift/assisted-service/internal/network.GetPrimaryMachineCidrForUserManagedNetwork file="/assisted-service/internal/network/machine_network_cidr.go:242" pkg=provider

time="2022-11-07T17:08:20Z" level=info msg="extracting openshift-install binary to /data/install-config-generate/installercache/quay.io/openshift-release-dev/ocp-release:4.12.0-ec.5-x86_64" func="github.com/openshift/assisted-service/internal/oc.(*release).extractFromRelease" file="/assisted-service/internal/oc/release.go:302" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=11953 request_id=

time="2022-11-07T17:12:01Z" level=info msg="No ImageContentSources in install-config to build ICSP file" func=github.com/openshift/assisted-service/internal/ignition.getIcspFileFromInstallConfig file="/assisted-service/internal/ignition/ignition.go:1608" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=43713 request_id=
time="2022-11-07T17:12:01Z" level=info msg="Listing objects by with prefix f63981b1-88bc-4cec-ab9f-8ff2a6787383/manifests/manifests" func="github.com/openshift/assisted-service/pkg/s3wrapper.(*S3Client).ListObjectsByPrefix" file="/assisted-service/pkg/s3wrapper/client.go:397" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=43713 request_id=
time="2022-11-07T17:12:01Z" level=info msg="Listing objects by with prefix f63981b1-88bc-4cec-ab9f-8ff2a6787383/manifests/openshift" func="github.com/openshift/assisted-service/pkg/s3wrapper.(*S3Client).ListObjectsByPrefix" file="/assisted-service/pkg/s3wrapper/client.go:397" cluster_id=f63981b1-88bc-4cec-ab9f-8ff2a6787383 go-id=43713 request_id=

time="2022-11-07T17:13:13Z" level=info msg="Host 958b214d-d81a-4d75-bcde-72ae3114aa1e in cluster f63981b1-88bc-4cec-ab9f-8ff2a6787383: reached installation stage Starting installation: bootstrap" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2UpdateHostInstallProgressInternal" file="/assisted-service/internal/bminventory/inventory.go:4975" go-id=44776 host_id=958b214d-d81a-4d75-bcde-72ae3114aa1e infra_env_id=7603028e-5767-4e00-9024-4d7613950d87 pkg=Inventory request_id=a8153e78-b415-4d6d-baad-374f7e0c602c

time="2022-11-07T17:24:07Z" level=info msg="Host 958b214d-d81a-4d75-bcde-72ae3114aa1e in cluster f63981b1-88bc-4cec-ab9f-8ff2a6787383: reached installation stage Done" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2UpdateHostInstallProgressInternal" file="/assisted-service/internal/bminventory/inventory.go:4975" go-id=53118 host_id=958b214d-d81a-4d75-bcde-72ae3114aa1e infra_env_id=7603028e-5767-4e00-9024-4d7613950d87 pkg=Inventory request_id=68b8a157-1ce2-4c6e-92f1-6e747e3ff54f

This is a clone of issue OCPBUGS-10836. The following is the description of the original issue:

Description of problem:

As a user when I select the All projects option from the Projects dropdown in the Dev perspective Pipelines pages then the selected option says as undefined. 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. Navigate to Pipelines page in the Dev perspective
2. Select the All projects option from the Projects dropdown

Actual results:

Selected option shows as undefined and all Projects list is not shown

Expected results:

Selected option should be All projects and open All projects list page

Additional info:

Tracker issue for bootimage bump in 4.13. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-10739.

Description of problem:

[Hypershift] default KAS PSA config should be consistent with OCP 
 enforce: privileged 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-21-084440

How reproducible:

Always

Steps to Reproduce:

1. Install OCP cluster and hypershift operator
2. Create hosted cluster
3. Check the default KAS config of the hosted cluster

Actual results:

The hosted cluster default KAS PSA config enforce is 'restricted'
$ jq '.admission.pluginConfig.PodSecurity' < `oc extract cm/kas-config -n clusters-9cb7724d8bdd0c16a113 --confirm`
{
  "location": "",
  "configuration": {
    "kind": "PodSecurityConfiguration",
    "apiVersion": "pod-security.admission.config.k8s.io/v1beta1",
    "defaults": {
      "enforce": "restricted",
      "enforce-version": "latest",
      "audit": "restricted",
      "audit-version": "latest",
      "warn": "restricted",
      "warn-version": "latest"
    },
    "exemptions": {
      "usernames": [
        "system:serviceaccount:openshift-infra:build-controller"
      ]
    }
  }
}

Expected results:

The hosted cluster default KAS PSA config enforce should be 'privileged' in

https://github.com/openshift/hypershift/blob/release-4.13/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L93

Additional info:

References: OCPBUGS-8710

ipv6 upgrade job has been failing for months

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-upgrade-ovn-ipv6

 

Looking at a few of the most recent runs the the failing test common to them all is 

disruption_tests: [sig-network-edge] Verify DNS availability during and after upgrade success 

e.g. from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-upgrade-ovn-ipv6/1620652913376366592

{Feb  1 07:08:57.228: too many pods were waiting: ns/e2e-check-for-dns-availability-7828 pod/dns-test-ec5a79ee-0091-4081-ae69-fa6a4a6ed3ee-7s48w,ns/e2e-check-for-dns-availability-7828 pod/dns-test-ec5a79ee-0091-4081-ae69-fa6a4a6ed3ee-94rq4,ns/e2e-check-for-dns-availability-7828 pod/dns-test-ec5a79ee-0091-4081-ae69-fa6a4a6ed3ee-t9wnk

github.com/openshift/origin/test/e2e/upgrade/dns.(*UpgradeTest).validateDNSResults(0x8793c91?, 0xc005f646e0)
	github.com/openshift/origin/test/e2e/upgrade/dns/dns.go:142 +0x2f4
github.com/openshift/origin/test/e2e/upgrade/dns.(*UpgradeTest).Test(0xc005f646e0?, 0x9407e78?, 0xcb34730?, 0x0?)
	github.com/openshift/origin/test/e2e/upgrade/dns/dns.go:48 +0x4e
github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc000c6f8c0, 0xc0005eebb8)
	github.com/openshift/origin/test/extended/util/disruption/disruption.go:201 +0x4a2
k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
	k8s.io/kubernetes@v1.25.0/test/e2e/chaosmonkey/chaosmonkey.go:94 +0x6a
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
	k8s.io/kubernetes@v1.25.0/test/e2e/chaosmonkey/chaosmonkey.go:91 +0x8b  }

Description of problem:

tech preview was merged in 4.13 https://issues.redhat.com/browse/WRKLDS-657, and the tests did made it only to 4.14 so we need to backport them

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

As discovered in MGMT-10863, for a single-stack cluster with hosts living in multiple subnets (not dual-stack scenario here) it should be forbidden to set multiple machine networks.

In principle

  • single-stack - single machine network
  • dual-stack - exactly 2 machine networks

Apparently we don't have a validation for that in the ValidateIPAddresses function - https://github.com/openshift/assisted-service/blob/master/internal/cluster/validations/validations.go#L308-L406.

We should add a simple statement there like

if !reqDualStack and len(machineNetworks) > 1 {
    err = errors.Errorf("Single-stack cluster cannot be created with multiple Machine Networks")
    return err
}

It is important to note that it does not obviously indicate an issue with our code responsible for autogeneration of machine networks. This is because the issue has been observed independently with SNO and multi-node clusters. Code for respective paths where all of them seem to only take a single entry

Nodes in Ironic are created following pattern <namespace>~<host name>.

However, when creating nodes in Ironic, baremetal-operator first creates them without a namespace, and only prepends the namespace prefix later. This open a possibility of node clashes, especially in the ACM context.

Description of problem:

system:openshift:openshift-controller-manager:leader-locking-ingress-to-route-controller role and role-binding should not be present in openshift-route-controller-manager namespace. Not needed since the leader locking responsibility was moved to route-controller-manager which is managed by leader-locking-openshift-route-controller-manager

This was added in and used by https://github.com/openshift/openshift-controller-manager/pull/230/files#diff-2ddbbe8d5a13b855786852e6dc0c6213953315fd6e6b813b68dbdf9ffebcf112R20

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Agent based installation is failing for Disconnected env due to pull secret is required for registry.ci.openshift.org. As we are installing cluster in disconnected env, only mirror registry secrets should be enough for pulling the image.

Version-Release number of selected component (if applicable):

registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-18-041406

How reproducible:

Always

Steps to Reproduce:

1. Setup mirror registry with this registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-18-041406 release. 
2. Add the ICSP information in the install-config file
4. Create agent.iso using install-config.yaml and agent-config.yaml
5. ssh to the node zero to see the error in create-cluster-and-infraenv.service. 

Actual results:

create-cluster-and-infraenv.service is failing with below error:
 
time="2022-10-18T09:36:13Z" level=fatal msg="Failed to register cluster with assisted-service: AssistedServiceError Code: 400 Href:  ID: 400 Kind: Error Reason: pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\""

Expected results:

create-cluster-and-infraenv.service should be successfully started.

Additional info:

Refer this similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1990659

Description of problem:

The vSphere status health item is misleading.

More info: https://coreos.slack.com/archives/CUPJTHQ5P/p1672829660214369

 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Have OCP 4.12 on vSphere
2. On the Cluster Dashboard (landing page), check the vSphere Status Health (static plugin)
3.

Actual results:

The icon shows pregress but nothing is progressing when the modal dialog is open

Expected results:

No misleading message and icon are rendered.

Additional info:

Since the Problem detector is not a reliable source and modification of the HealthItem in the OCP Console is too complex task for the recent state of release, a non-misleading text is good-enough.

Description of problem:

An update from 4.13.0-ec.2 to 4.13.0-ec.3 stuck on:

$ oc get clusteroperator machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.13.0-ec.2   True        True          True       30h     Unable to apply 4.13.0-ec.3: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 105, ready 105, updated: 105, unavailable: 0)]

The worker MachineConfigPool status included:

Unable to find source-code formatter for language: node. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      type: NodeDegraded
    - lastTransitionTime: "2023-02-16T14:29:21Z"
      message: 'Failed to render configuration for pool worker: Ignoring MC 99-worker-generated-containerruntime
        generated by older version 8276d9c1f574481043d3661a1ace1f36cd8c3b62 (my version:
        c06601510c0917a48912cc2dda095d8414cc5182)'

Version-Release number of selected component (if applicable):

4.13.0-ec.3. The behavior was apparently introduced as part of OCPBUGS-6018, which has been backported, so the following update targets are expected to be vulnerable: 4.10.52+, 4.11.26+, 4.12.2+, and 4.13.0-ec.3.

How reproducible:

100%, when updating into a vulnerable release, if you happen to have leaked MachineConfig.

Steps to Reproduce:

1. 4.12.0-ec.1 dropped cleanUpDuplicatedMC. Run a later release, like 4.13.0-ec.2.
2. Create more than one KubeletConfig or ContainerRuntimeConfig targeting the worker pool (or any pool other than master). The number of clusters who have had redundant configuration objects like this is expected to be small.
3. (Optionally?) delete the extra KubeletConfig and ContainerRuntimeConfig.
4. Update to 4.13.0-ec.3.

Actual results:

Update sticks on the machine-config ClusterOperator, as described above.

Expected results:

Update completes without issues.

Description of problem:

The no-capabilities job is currently failing because of some storage tests that seemingly can't pass when all optional capabilities are disabled.  See the results in:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-e2e-aws-ovn-no-capabilities/1628034290317004800

There was a previous attempt[1] to filter out the storage tests that can't pass in this environment, but it seems like some were missed.  The remaining failures (7 or so) need to be evaluated to determine if the tests should also be skipped in the no-caps job, or if the tests themselves should be modified to tolerate the no-caps cluster.

[1] https://github.com/openshift/origin/pull/27654 

Failing tests:

[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic Snapshot (delete policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion (persistent) [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Ephemeral Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works, check deletion (ephemeral) [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-storage] Mounted volume expand [Feature:StorageProvider] Should verify mounted devices can be resized [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Ephemeral Snapshot (delete policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works, check deletion (ephemeral) [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Pre-provisioned Snapshot (delete policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion (persistent) [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Pre-provisioned Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion (persistent) [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic Snapshot (retain policy)] snapshottable[Feature:VolumeSnapshotDataSource] volume snapshot controller should check snapshot fields, check restore correctly works after modifying source data, check deletion (persistent) [Suite:openshift/conformance/parallel] [Suite:k8s]

For most(but not all) of the tests the failure seems to be:
{  fail [k8s.io/kubernetes@v1.25.0/test/e2e/storage/framework/snapshot_resource.go:65]: Feb 21 14:54:23.209: the server could not find the requested resource
Ginkgo exit error 1: exit with code 1}

So probably there's a resource type that's not being registered when the storage cap is disabled, which these tests rely on.  So again, either the tests need to changed to be able to work w/o that resource, or the tests should be skipped in a no-caps cluster because they are not relevant.


Version-Release number of selected component (if applicable):

v4.13

How reproducible:

Always

Steps to Reproduce:

1. Run the no-caps job, or bring up a cluster with capabilitySet: None
2. run the storage tests in question

Actual results:

tests fail

Expected results:

tests pass or are skipped

Description of problem:

Now that the bug to include libnmstate.2.2.x has been resolved (https://issues.redhat.com/browse/OCPBUGS-11659) we are seeing a boot issue in which agent-tui can't start. It looks like it is failing to find the symlink libnmstate.2 as when its run directly we see 
$ /usr/local/bin/agent-tui
/usr/local/bin/agent-tui: error while loading shared libraries: libnmstate.so.2: cannot open shared object file: No such file or directory

This results neither the console or ssh available in bootstrap which makes debugging difficult. However it does not affect the installation as we still get a successful install. The bootstrap screenshots are attached.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

When cnv change https://github.com/openshift/assisted-service/pull/4434/files#diff-65564e0a228289a9ea8a5a502f0290402a5b2ab619d5ddd24be51bf6c37838a8R153 was introduced specific version was set as hardcoded and missed in review.

So each time monitor is running it provides 4.12 as version to further functions

How reproducible:

Install any cluster and check monitor in code

Steps to reproduce:

1.

2.

3.

Actual results:

4.12 is hardcoded as version in host monitor

Expected results:

Each cluster has its own version and if not just CNV operator should return result of default version that is 4.11 right now

Description of problem:

grant monitoring-alertmanager-edit  role to user

# oc adm policy add-cluster-role-to-user cluster-monitoring-view testuser-11

# oc adm policy add-role-to-user monitoring-alertmanager-edit testuser-11 -n openshift-monitoring --role-namespace openshift-monitoring

monitoring-alertmanager-edit user, go to administrator console, "Observe - Alerting - Silences" page is pending to list silences, debug in the console, no findings.

 

create silence with monitoring-alertmanager-edit user for Watchdog alert, silence page is also pending, checked with kubeadmin user, "Observe - Alerting - Silences" page shows the Watchdog alert is silenced, but checked with  monitoring-alertmanager-edit user, Watchdog alert is not silenced.

this should be a regression for https://bugzilla.redhat.com/show_bug.cgi?id=1947005 since 4.9, no such issue then, but there is similiar issue with 4.9.0-0.nightly-2022-09-05-125502 now

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-08-114806

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

administrator console, monitoring-alertmanager-edit user list or create silence, "Observe - Alerting - Silences" page is pending

Expected results:

should not be pending

Additional info:

 

Description of the problem:

When installing SNO via assisted-installer with the EC version of 4.13, the local dnsmasq process in the node is not listening on all interfaces, and only listens for localhost loopback.

It makes kubelet and kube-apiserver unable to resolve the fqdn and api/api-int by locally requesting dns resolution from the dnsmasq process.

How reproducible:

100%

Steps to reproduce:

with test-infra:

  1. export OPENSHIFT_INSTALL_RELEASE_IMAGE=registry.ci.openshift.org/rhcos-devel/ocp-4.13-9.0:4.13.0-ec.1
  2. Get a token for registry-ci and embed it in the pull-secret. Make sure PULL_SECRET is defined and have a valid credentials to registry-ci
  3. make run deploy_nodes_with_install OPENSHIFT_VERSION=4.13 NUM_MASTERS=1

Actual results:

SNO installation is stuck and does not proceed.

Expected results:

Successful SNO installation.

If all the add actions in the add page is disabled in customization, then details on/off switch has to be disabled since it is of no use if all actions are disabled

Test Setup:
1) In cluster configuration of Console CRD, disable all the add page options 

Description of problem:

The workaround for BZ 1854355 seems to be permanent now since BZ 1854355 was closed based on this workaround from BZ 1869886 below.

https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/configure-ovs-network.yaml#L766

# if the interface is of type vmxnet3 add multicast capability for that driver
# REMOVEME: Once BZ:1854355 is fixed, this needs to get removed.
function configure_driver_options {
  intf=$1
  if [ ! -f "/sys/class/net/${intf}/device/uevent" ]; then
    echo "Device file doesn't exist, skipping setting multicast mode"
  else
    driver=$(cat "/sys/class/net/${intf}/device/uevent" | grep DRIVER | awk -F "=" '{print $2}')
    echo "Driver name is" $driver
    if [ "$driver" = "vmxnet3" ]; then
      ifconfig "$intf" allmulti
    fi
  fi

If this is permanent then we also need to update the workaround for ifconfig deprecation and future RHEL9 support

ifconfig "$intf" allmulti  -> ip link set dev "${inft}" allmulticast on

We should update the comment and ensure the workaround moves into requirements and OPNET-10

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-01-23-085522
 

How reproducible:

On vSphere, always.

Steps to Reproduce:

1. install vSphere OVN cluster with vmxnet3

Actual results:

Install succeeds

Expected results:

Install succeeds

Detected several duplicates:

  • update pull-secret
  • Update SchedulableMasters
  • Update Proxy
  • Day2 api vip dnsname/ip
  • ssh key

Description of problem:

Single node jobs are failing on this error message:

{  fail [github.com/openshift/origin/test/extended/single_node/topology.go:97]: pod-identity-webhook in openshift-cloud-credential-operator namespace expected to have 1 replica but got 2
Expected
    <int>: 2
to equal
    <int>: 1
Ginkgo exit error 1: exit with code 1} 

Example run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-ovn-single-node/1620827236984688640

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Additional info:

This started happening after https://github.com/openshift/cloud-credential-operator/pull/492 merged, the number of replicas should be dependent on the cluster topology (HA or single node).

 

sippy-link=[variants=single-node]

This is a clone of issue OCPBUGS-13372. The following is the description of the original issue:

Description of problem:

The test for updating the sysctl whitelist fails to check the error returned when the pod running state is verified.

Test is always passing. We failed to detect a bug in the cluster network operator for the allowlist controller.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Not all monitoring components configure Prometheus to use mTLS for accessing their /metrics endpoint. Some continue using bearer token authentication (for instance openshift-state-metrics). 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Check the Prometheus configuration
2.
3.

Actual results:

Scrape configuration for the following monitoring components don't configure mTLS for scraping metrics:
* openshift-state-metrics
* thanos-ruler (when UWM is enabled)

Scrape configuration looks like (note that there's no cert_file & key_file):

  authorization:
    type: Bearer
    credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
    server_name: cluster-monitoring-operator.openshift-monitoring.svc
    insecure_skip_verify: false

Expected results:

scrape configurations use mTLS for authentication like this:

  tls_config:
    ca_file: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
    cert_file: /etc/prometheus/secrets/metrics-client-certs/tls.crt
    key_file: /etc/prometheus/secrets/metrics-client-certs/tls.key
    server_name: alertmanager-main.openshift-monitoring.svc
    insecure_skip_verify: false

Additional info:

cluster-monitoring-operator still uses bearer token for authentication because it's managed by CVO and we have no easy way to inject the client CA into the cluster-monitoring-operator deployment.

 

 

 

 

This is a clone of issue OCPBUGS-9956. The following is the description of the original issue:

Description of problem:

PipelineRun default template name has been updated in the backend in Pipeline operator 1.10, So we need to update the name in the UI code as well.

 

https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pac/const.ts#L9

 

This is a clone of issue OCPBUGS-11280. The following is the description of the original issue:

Description of problem:

There is forcedns dispatcher script added by assisted installed installation process that create etc/resolv.conf 

This script has no shebang that caused installation to fail as no resolv.conf was generated. 

I order to fix upgrades in already installed clusters we need to workaround this issue.

 

Version-Release number of selected component (if applicable):

4.13.0

How reproducible:

Happens every time

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

4.12 cluster, no pv for prometheus, the doc still link to 4.8

# oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded"))'
[
  {
    "lastTransitionTime": "2022-10-09T02:36:16Z",
    "message": "Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html",
    "reason": "PrometheusDataPersistenceNotConfigured",
    "status": "False",
    "type": "Degraded"
  }
]

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

always

Steps to Reproduce:

1. no PVs for prometheus, check the monitoring operator status
2.
3.

Actual results:

the doc still link to 4.8

Expected results:

links to the latest doc

Additional info:

slack thread: 
https://coreos.slack.com/archives/G79AW9Q7R/p1665283462123389

Description of problem:
Current prometheus rules for MCO do not define namespace:
https://github.com/openshift/machine-config-operator/blob/master/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L98

As per style guide namespace should be included: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

Version-Release number of selected component (if applicable):

4.11.x but most likely older versions impacted too.

How reproducible:

always

Steps to Reproduce:

1. Trigger any alert from MCO (i.e. allocate a lot of memory on control plane)

Actual results:

Alert has no namespace indicating source

Expected results:

Most likely "openshift-machine-config-operator" namespace label is defined for the alerts.

I suppose this is a 4.12+ change if anything.

Steps to reproduce:
Release: 4.13.0-0.nightly-2022-11-30-183109 (latest 4.12 nightly as well)
Create a HyperShift cluster on AWS, wait til its completed rolling out
Upgrade the HostedCluster by updating its release image to a newer one
Observe the 'network' clusteroperator resource in the guest cluster as well as the 'version' clusterversion resource in the guest cluster.
When the clusteroperator resource reports the upgraded release and the clusterversion resource reports the new release as applied, take a look at the ovnkube-master statefulset in the control plane namespace of the management cluster. It is still not finished rolling out.

Expected: that the network clusteroperator reports the new version only when all components have finished rolling out.

Description of problem:
This is a follow-up on OCPBUGS-2579, where Prabhu fixed a similar issue for catalog items "Helm Charts" and "Samples". The same issue happens also for Serverless actions "Event Sink", "Event Source", "Channel" and "Broken".

Version-Release number of selected component (if applicable):
4.13, earlier versions have the same issue

How reproducible:
Always

Steps to Reproduce:
1. Install Serverless operator and create Eventing and Serving resources
2. Import an application (Developer perspective > add > container image)
3. Open customization (path /cluster-configuration/) and disable all add actions
4. Wait some seconds and check that the Developer perspective > Add page shows no items
5. Navigate to topology perspective and right click outside of the app / workload

Actual results:
Add to project menu is shown with the options "Event Sink", "Event Source", "Channel" and "Broken".

Expected results:
The options "Event Sink", "Event Source", "Channel" and "Broken" should not be shown when they are disabled.

Additional info:
Follow up on OCPBUGS-2579

 Currently when controller has an issue with dns pointing to wrong ip, we don't know why.

We should add resolv.conf to controller logs to be able to see that it was managed successfully 

Description of problem:

On Make Serverless page, to change values of the inputs minpod, maxpod and concurrency fields, we need to click the ‘ + ’ or ‘ - ', it can't be changed by typing in it.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

always

Steps to Reproduce:

1. Create a deployment workload from import from git
2. Right click on workload and select Make Serverless option
3. Check functioning of inputs minpod, maxpod etc.

Actual results:

To change values of the inputs minpod, maxpod and concurrency fields, we need to click the ‘ + ’ or ‘ - ', it can't be changed by typing in it.

Expected results:

We can change values of the inputs minpod, maxpod and concurrency fields, by clicking the ‘ + ’ or ‘ - ' and also by typing in it.

Additional info:

Works fine in v4.11

Description of problem:
This is a follow up on OCPBUGSM-47202 (https://bugzilla.redhat.com/show_bug.cgi?id=2110570)

While OCPBUGSM-47202 fixes the issue specific for Set Pod Count, many other actions aren't fixed. When the user updates a Deployment with one of this options, and selects the action again, the old values are still shown.

Version-Release number of selected component (if applicable)
4.8-4.12 as well as master with the changes of OCPBUGSM-47202

How reproducible:
Always

Steps to Reproduce:

  1. Import a deployment
  2. Select the deployment to open the topology sidebar
  3. Click on actions and one of the 4 options to update the deployment with a modal
    1. Edit labels
    2. Edit annotatations
    3. Edit update strategy
    4. Edit resource limits
  4. Click on the action again and check if the data in the modal reflects the changes from step 3

Actual results:
Old data (labels, annotations, etc.) was shown.

Expected results:
Latest data should be shown

Additional info:

Description of problem: While running scale tests of OpenShift on OpenStack at scale, we're seeing it performing significantly worse than on AWS platform for the same number of nodes. More specifically, we're seeing high traffic to API server, and high load for the haproxy pod.

Version-Release number of selected component (if applicable):

All supported versions

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Slack thread at https://coreos.slack.com/archives/CBZHF4DHC/p1669910986729359 provides more info.

Description of problem:

While running ./openshift-install agent wait-for install-complete --dir billi --log-level debug on a real bare metal dual stack compact cluster installation it errors out with ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused but installation is still progressing

DEBUG Uploaded logs for host openshift-master-1 cluster d8b0979d-3d69-4e65-874a-d1f7da79e19e 
DEBUG Host: openshift-master-1, reached installation stage Rebooting 
DEBUG Host: openshift-master-1, reached installation stage Configuring 
DEBUG Host: openshift-master-2, reached installation stage Configuring 
DEBUG Host: openshift-master-2, reached installation stage Joined 
DEBUG Host: openshift-master-1, reached installation stage Joined 
DEBUG Host: openshift-master-0, reached installation stage Waiting for bootkube 
DEBUG Host openshift-master-1: updated status from installing-in-progress to installed (Done) 
DEBUG Host: openshift-master-1, reached installation stage Done 
DEBUG Host openshift-master-2: updated status from installing-in-progress to installed (Done) 
DEBUG Host: openshift-master-2, reached installation stage Done 
DEBUG Host: openshift-master-0, reached installation stage Waiting for controller: waiting for controller pod ready event 
ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused 
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR 				The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR 				https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 

Version-Release number of selected component (if applicable):

4.12.0-rc.0

How reproducible:

100%

Steps to Reproduce:

1. ./openshift-install agent create image --dir billi --log-level debug 
2. mount resulting iso image and reboot nodes via iLO
3. /openshift-install agent wait-for install-complete --dir billi --log-level debug 

Actual results:

 ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused 

cluster installation is not complete and it needs more time to complete 

Expected results:

waits until the cluster installation completes

Additional info:

The cluster installation eventually completes fine if waiting after the error.

Attaching install-config.yaml and agent-config.yaml

Description of problem:

Ingress Controller is missing a required AWS resource permission for SC2S region us-isob-east-1

During the OpenShift 4 installation in SC2S region us-isob-east-1, the ingress operator degrades due to missing "route53:ListTagsForResources" permission from the "openshift-ingress" CredentialsRequest for which customer proactively raised a PR.
--> https://github.com/openshift/cluster-ingress-operator/pull/868

The code disables part of the logic for C2S isolated regions here: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L167-L168
By not setting tagConfig, it results in the m.tags field to be set nil: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L212-L222
This then drives the logic in the getZoneID method to use either lookupZoneID or lookupZoneIDWithoutResourceTagging: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L280-L284
BLAB: the lookupZoneIDWithoutResourceTagging method is only ever called for endpoints.AwsIsoPartitionID, endpoints.AwsIsoBPartitionID regions.

Version-Release number of selected component (if applicable):

 

How reproducible:

Everytime

Steps to Reproduce:

1. Create an IPI cluster in  SC2S region us-isob-east-1.

Actual results:

Ingress operator degrades due to missing "route53:ListTagsForResources" permission with following error.
~~~
The DNS provider failed to ensure the record: failed to find hosted zone for record: failed to get tagged resources: AccessDenied: User ....... rye... is not authorized to perform: route53:ListTagsForResources on resource.... hostedzone/.. because no identify based policy allows the route53:ListTagsForResources
~~~

Expected results:

Ingress operator should be in available state for new installation.

Additional info:

 

Description of problem:

Deploy IPI cluster on azure cloud, set region as westeurope, vm size as EC96iads_v5 or EC96ias_v5. Installation fails with below error:

12-15 11:47:03.429  level=error msg=Error: creating Linux Virtual Machine: (Name "jima-15a-m6fzd-bootstrap" / Resource Group "jima-15a-m6fzd-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The VM size 'Standard_EC96iads_v5' is not supported for creation of VMs and Virtual Machine Scale Set with '<NULL>' security type."

Similar as https://bugzilla.redhat.com/show_bug.cgi?id=2055247.

From azure portal, we can see that the type of both vm size EC96iads_v5 and EC96ias_v5 are confidential compute.

Might also need to do similar process for them as what did in bug 2055247.

 

Version-Release number of selected component (if applicable):

4.12 nightly build

How reproducible:

Always

Steps to Reproduce:

1. Prepare install-config.yaml file, set region as westeurope, vm size as EC96iads_v5 or EC96ias_v5
2. Deploy IPI azure cluster
3.

Actual results:

Install failed with error in description

Expected results:

Installer should be exited during validation and show expected error message. 

Additional info:

 

 

We do not have tooling to update past release data, we merged a known disruption regression after 4.13 was forked, and now we have prs blocked on P99 disruption tests because the data can't be adapted. First step to temporarily disable all disruption testing.

Aggregated job testing will continue.

This is a clone of issue OCPBUGS-6770. The following is the description of the original issue:

When displaying my pipeline it is not rendered correctly with overlapping segments between parallel branches. However if I edit the pipeline then it appears fine. I have attached screenshots showing the issue.

This is a regression from 4.11 where it rendered fine.

Description of problem:

IPI on BareMetal Dual stack deployment failed and Bootstrap timed out before completion

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Always

Steps to Reproduce:

1. Deploy IPI on BM using Dual stack 
2.
3.

Actual results:

Deployment failed

Expected results:

Should pass

Additional info:

Same deployment works fine on 4.11

Description of problem:

When there is a problem while rebooting a node, a MCDRebootError alarm is risen. This alarm disappears after 15 minutes, even if the machine was not rebooted.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-12-22-120609   True        False         26m     Cluster version is 4.13.0-0.nightly-2022-12-22-120609

How reproducible:

Always

Steps to Reproduce:

1. Execute these commands in a worker node in order to break the reboot process.

$ mount -o remount,rw /usr
$ mv /usr/bin/systemd-run /usr/bin/systemd-run2

2. Creat any MC. For example, this one:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test


Actual results:

A MCDRebootError alarm is triggered. But after 15 minutes this alarm disappears.

Expected results:

The alarm should not disappear after 15 minutes. It should remain there until the node is rebooted.


Additional info:

This is the PR that seems to introduce this behavior
https://github.com/openshift/machine-config-operator/pull/3406#discussion_r1030481908


 

Description of problem:

when trying to deploy a Hypershift spoke cluster with zero workers, the HypershiftAgentServiceConfig(HASC) operator fails to apply properly

Steps to Reproduce:

  1. deploy hub cluster normally.
  2. deploy MCE/ACM operator and hypershift operator normally.
  3. create hypershift spoke cluster agent.
  4. create and apply HASC operator yaml.

Actual results:

$ oc describe hypershiftagentserviceconfigs.agent-install.openshift.io -A
....
Status:
  Conditions:
    Last Transition Time:  2022-12-04T14:01:01Z
    Message:               Failed to sync agent-install CRDs on spoke cluster: agent-install CRDs are not available
    Reason:                SpokeClusterCRDsSyncFailure
    Status:                False
    Type:                  ReconcileCompleted

the agent-install CRD's seem to be missing relevant label  ‘operators.coreos.com/assisted-service-operator.assisted-installer'

 

$ oc describe crd agents.agent-install.openshift.io -A 

Name:         agents.agent-install.openshift.io
Namespace:    
Labels:       <none>

Expected results:

HASC should apply properly, and installation of spoke cluster should continue.

Additional info:

hasc.yaml example:

apiVersion: agent-install.openshift.io/v1beta1
kind: HypershiftAgentServiceConfig
metadata:
 name: hypershift-agent
 namespace: spoke-0
spec:
 kubeconfigSecretRef:
   name: spoke-0-kubeconfig
 databaseStorage:
  accessModes:
  - ReadWriteOnce
  resources:
   requests:
    storage: 8Gi
 filesystemStorage:
  accessModes:
  - ReadWriteOnce
  resources:
   requests:
    storage: 8Gi
 imageStorage:
  accessModes:
  - ReadWriteOnce
  resources:
   requests:
    storage: 10Gi

 

Description of problem:

When running an overnight run in dev-scripts (COMPACT_IPV4) with repeated installs I saw this panic in WaitForBootstrapComplete occur once.

level=debug msg=Agent Rest API Initialized
E1101 05:19:09.733309 1802865 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4086520?, 0x1d875810})
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00056fb00?})
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x4086520, 0x1d875810})
    /usr/local/go/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/agent.(*NodeZeroRestClient).getClusterID(0xc0001341e0)
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/rest.go:121 +0x53
github.com/openshift/installer/pkg/agent.(*Cluster).IsBootstrapComplete(0xc000134190)
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/cluster.go:183 +0x4fc
github.com/openshift/installer/pkg/agent.WaitForBootstrapComplete.func1()
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:31 +0x77
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x1d8fa901?)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0001958c0?, {0x1a53c7a0, 0xc0011d4a50}, 0x1, 0xc0001958c0)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0009ab860?, 0x77359400, 0x0, 0xa?, 0x8?)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92
github.com/openshift/installer/pkg/agent.WaitForBootstrapComplete({0x7ffd7fccb4e3?, 0x40d7e7?})
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:30 +0x1bc
github.com/openshift/installer/pkg/agent.WaitForInstallComplete({0x7ffd7fccb4e3?, 0x5?})
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:73 +0x56
github.com/openshift/installer/cmd/openshift-install/agent.newWaitForInstallCompleteCmd.func1(0xc0003b6c80?, {0xc0004d67c0?, 0x2?, 0x2?})
    /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/agent/waitfor.go:73 +0x126
github.com/spf13/cobra.(*Command).execute(0xc0003b6c80, {0xc0004d6780, 0x2, 0x2})
    /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc0013b0a00)
    /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
    /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918
main.installerMain()
    /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0
main.main()
    /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x33d3cd3]

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Occurred on the 12th run, all previous installs were successfule

Steps to Reproduce:

1.Set up dev-scripts for  AGENT_E2E_TEST_SCENARIO=COMPACT_IPV4, no mirroring
2. Run 'make clean; make agent' in a loop
3. After repeated installs got the failure

Actual results:

Panic in WaitForBootstrapComplete

Expected results:

No failure

Additional info:

It looks like clusterResult is used here even on failure, which causes the dereference - https://github.com/openshift/installer/blob/master/pkg/agent/rest.go#L121

 

Description of problem:

The current implementation of registries.conf support is not working as expected. This bug report will outline the expectations of how we believe this should work.

Background

The containers/image project defines a configuration file called registries.conf, which controls how image pulls can be redirected to another registry. Effectively the pull request for a given registry is redirected to another registry which can satisfy the image pull request instead. The specification for the registries.conf file is located here. For tools such as podman and skopeo, this configuration file allows those tools to indicate where images should be pulled from, and the containers/image project rewrites the image reference on the fly and tries to get the image from the first location it can, preferring these "alternate locations" and then falling back to the original location if one of the alternate locations can't satisfy the image request.

An important aspect of this redirection mechanism is it allows the "host:port" and "namespace" portions of the image reference to be redirected. To be clear on the nomenclature used in the registries.conf specification, a namespace refers to zero or more slash separated sections leading up to the image name (which is called repo in the specification and has the tag or digest after it. See repo(:_tag|@digest) below) and the host[:port] refers to the domain where the image registry is being hosted.

Example:

host[:port]/namespace[/namespace…]/repo(:_tag|@digest)

For example, if we have an image called myimage@sha:1234 the and the image normally resides in quay.io/foo/myimage@sha:1234 you could redirect the image pull request to my registry.com/bar/baz/myimage@sha:1234. Note that in this example the alternate registry location is in a different host, and the namespace "path" is different too.

Use Case

In a typical development scenario, image references within an OLM catalog should always point to a production location where the image is intended to be pulled from when a catalog is published publicly. Doing this prevents publishing a catalog which contains image references to internal repositories, which would never be accessible by a customer. By using the registries.conf redirection mechanism, we can perform testing even before the images are officially published to public locations, and we can redirect the image reference from a production location to an internal repository for testing purposes. Below is a simple example of a registries.conf file that redirects image pull requests away from prodlocation.io to preprodlocation.com:

[[registry]]
 location = "prodlocation.io/xx"
 insecure = false
 blocked = false
 mirror-by-digest-only = true
 prefix = ""
 [[registry.mirror]]
  location = "preprodlocation.com/xx"
  insecure = false

Other Considerations

  • We only care about redirection of images during image pull. Image redirection on push is out of scope.
  • We would like to see as much support for the fields and TOML tables defined in the spec as possible. That being said, there are some items we don't really care about.
    • supported:
      • support multiple [[registry]] TOML tables
      • support multiple [[registry.mirror]] TOML tables for a given [[registry]] TOML table
      • if all entires of [[registry.mirror]] for a given [[registry]] TOML table do not resolve an image, the original [[registry]] TOML locations should be used as the final fallback (this is consistent with how the specification is written, but want to make this point clear. See the specification example which describes how things should work.
      • prefix and location
        • These fields work together, so refer to the specification for how this works. If necessary, we could simplify this to only use location since we are unlikely to use the prefix option.
      • insecure
        • this should be supported for the [[registry]] and [[registry.mirror]] TOML tables so you know how to access registries. If this is not needed by oc mirror then we can forgo this field.
    • fields that require discussion:
      • we assume that digests and tags can be supplied for an image reference, but in the end digests are required for oc mirror to keep track of the image in the workspace. It's not clear if we need to support these configuration options or not:
        • mirror-by-digest-only
          • we assume this is always false since we don't need to prevent an image from being pulled if it is using a tag
        • pull-from-mirror
          • we assume this is always all since we don't need to prevent an image from being pulled if it is using a tag
    • does not need to be supported:
      • unqualified-search-registries
      • credential-helpers
      • blocked
      • aliases
  • we are not interested in supporting version 1 of registries.conf since it is deprecated

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

oc mirror -c ImageSetConfiguration.yaml --use-oci-feature --oci-feature-action mirror --oci-insecure-signature-policy --oci-registries-config registries.conf --dest-skip-tls docker://localhost:5000/example/test

Example registries.conf

[[registry]]
  prefix = ""
  insecure = false
  blocked = false
  location = "prod.com/abc"
  mirror-by-digest-only = true
  [[registry.mirror]]
    location = "internal.exmaple.io/cp"
    insecure = false
[[registry]]
  prefix = ""
  insecure = false
  blocked = false
  location = "quay.io"
  mirror-by-digest-only = true
  [[registry.mirror]]
    location = "internal.exmaple.io/abcd"
    insecure = false

 

Actual results:

images are not pulled from "internal" registry

Expected results:

images should be pulled from "internal" registry

Additional info:

The current implementation in oc mirror creates its own structs to approximate the ones provided by the containers/image project, but it might not be necessary to do that. Since the oc mirror project already uses containers/image as a dependency, it could leverage the FindRegistry function, which takes a image reference, loads the registries.conf information and returns the most appropriate [[registry]] reference (in the form of Registry struct) or nil if no match was found. Obviously custom processing will be necessary to do something useful with the Registry instance. Using this code is not a requirement, just a suggestion of another possible path to load the configuration.

Description of problem: there is unnecessary padding around the alert at the top of the debug pod terminal.

Steps to Reproduce:


1. Create an crash looping pod by creating the default `example` pod using the `Create pod` button at the top of the Pods list page
2. Once the pod has a `Status` of `CrashLoopBackOff`, click `CrashLoopBackOff` so the popover appears and then click the `Debug container httpd` link within the popover
3. Note the alert at the top of the resulting `Debug httpd` page has unnecessary padding of 10px on all sides.

Description of problem:

https://github.com/openshift/operator-framework-olm/blob/7ec6b948a148171bd336750fed98818890136429/staging/operator-lifecycle-manager/pkg/controller/operators/olm/plugins/downstream_csv_namespace_labeler_plugin_test.go#L309

has a dependency on creation of a next-version release branch.

 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. clone operator-framework/operator-framework-olm
2. make unit/olm
3. deal with a really bumpy first-time kubebuilder/envtest install experience
4. profit

 

 

Actual results:

error

Expected results:

pass

Additional info:

 

 

Description of problem:

I found OLM version is 0.17.0 for OCP 4.6, 4.7, 4.8, see:

https://github.com/openshift/operator-framework-olm/blob/release-4.6/staging/operator-lifecycle-manager/OLM_VERSION 

https://github.com/openshift/operator-framework-olm/blob/release-4.7/staging/operator-lifecycle-manager/OLM_VERSION

https://github.com/openshift/operator-framework-olm/blob/release-4.8/staging/operator-lifecycle-manager/OLM_VERSION

OLM version is 0.18.3 for OCP 4.9, https://github.com/openshift/operator-framework-olm/blob/release-4.9/staging/operator-lifecycle-manager/OLM_VERSION 

OLM version is 0.19.0 for OCP 4.10, 4.11, 4.12

https://github.com/openshift/operator-framework-olm/blob/release-4.10/staging/operator-lifecycle-manager/OLM_VERSION 

https://github.com/openshift/operator-framework-olm/blob/release-4.11/staging/operator-lifecycle-manager/OLM_VERSION 

https://github.com/openshift/operator-framework-olm/blob/release-4.12/staging/operator-lifecycle-manager/OLM_VERSION 

It's unclear to the user. What's the version naming rule we should follow? Thanks!

 

Version-Release number of selected component (if applicable):

4.6 -> 4.12

How reproducible:

always

Steps to Reproduce:

1. Check the OLM version
MacBook-Pro:operator-framework-olm jianzhang$ oc exec catalog-operator-7f4f564c97-fvzl4  -- olm --version
OLM version: 0.19.0
git commit: 11644a5433442c33698d2eee8d3f865b0d9386c0
MacBook-Pro:operator-framework-olm jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-31-101631   True        False         9h      Error while reconciling 4.12.0-0.nightly-2022-08-31-101631: authentication, kube-controller-manager, machine-config, openshift-apiserver has an unknown error: ClusterOperatorsDegraded 

Actual results:

see above desciption.

Expected results:

OLM version should follow a clear version naming rule to align with OCP version.

Additional info:

 

This is a clone of issue OCPBUGS-8232. The following is the description of the original issue:

Description of problem:

oc patch project command is failing to annotate the project

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Run the below patch command to update the annotation on existing project
~~~
oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "This is a new project"}}}'
~~~


Actual results:

It produces the error output below:
~~~
The Project "<PROJECT_NAME>" is invalid: * metadata.namespace: Invalid value: "<PROJECT_NAME>": field is immutable * metadata.namespace: Forbidden: not allowed on this type 
~~~ 

Expected results:

The `oc patch project` command should patch the project with specified annotation.

Additional info:

Tried to patch the project with OCP 4.11.26 version, and it worked as expected.
~~~
oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "New project"}}}'

project.project.openshift.io/<PROJECT_NAME> patched
~~~

The issue is with OCP 4.12, where it is not working. 

 

This is a clone of issue OCPBUGS-11921. The following is the description of the original issue:

Description of problem:

IPI installation to a shared VPC with 'credentialsMode: Manual' failed, due to no IAM service accounts for control-plane machines and compute machines

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-18-005127

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", and then insert interested settings in install-config.yaml
2. "create manifests"
3. run "ccoctl" to create the required credentials
4. grant the above IAM service accounts the required permissions in the host project (see https://github.com/openshift/openshift-docs/pull/58474)
5. "create cluster" 

Actual results:

The installer doesn't create the 2 IAM service accounts, one for control-plane machine and another for compute machine, so that no compute machine getting created, which leads to installation failure.

Expected results:

The installation should succeed.

Additional info:

FYI https://issues.redhat.com/browse/OCPBUGS-11605
$ gcloud compute instances list --filter='name~jiwei-0418'
NAME                        ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP  STATUS
jiwei-0418a-9kvlr-master-0  us-central1-a  n2-standard-4               10.0.0.62                 RUNNING
jiwei-0418a-9kvlr-master-1  us-central1-b  n2-standard-4               10.0.0.58                 RUNNING
jiwei-0418a-9kvlr-master-2  us-central1-c  n2-standard-4               10.0.0.29                 RUNNING
$ gcloud iam service-accounts list --filter='email~jiwei-0418'
DISPLAY NAME                                                     EMAIL                                                                DISABLED
jiwei-0418a-14589-openshift-image-registry-gcs                   jiwei-0418a--openshift-i-zmwwh@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-machine-api-gcp                      jiwei-0418a--openshift-m-5cc5l@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-cloud-credential-operator-gcp-ro-creds         jiwei-0418a--cloud-crede-p8lpc@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-gcp-ccm                              jiwei-0418a--openshift-g-bljz6@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-ingress-gcp                          jiwei-0418a--openshift-i-rm4vz@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-cloud-network-config-controller-gcp  jiwei-0418a--openshift-c-6dk7g@openshift-qe.iam.gserviceaccount.com  False
jiwei-0418a-14589-openshift-gcp-pd-csi-driver-operator           jiwei-0418a--openshift-g-pjn24@openshift-qe.iam.gserviceaccount.com  False
$

 

This is a clone of issue OCPBUGS-8692. The following is the description of the original issue:

Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

multus-admission-controller
cloud-network-config-controller
ovnkube-master

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

 

Description of problem:

When the IBMCOS storage type is used but the config itself is empty, it causes the cluster-image-registry-operator to fall into a crashloop due to NPE.

Version-Release number of selected component (if applicable):

 

How reproducible:

Can be reproduced easily.

Steps to Reproduce:

1. Create an OpenShift cluster on 4.9+
2. Alter the image registry storage config to use "spec.storage.ibmcos: {}"
3. Watch the cluster-image-registry-operator crashloop

Actual results:

The cluster-image-registry-operator pod constantly in a crashloop.

Expected results:

The cluster-image-registry-operator pod to run without issues.

Additional info:

https://github.com/openshift/cluster-image-registry-operator/issues/835

Description of the problem:

In staging, BE 2.13.5 - 1 host in disconnected state makes it impossible to make changes to the cluster (Change host's role, generate new iso, etc.)  without deleting the host first. Error message shown is not clear.

How reproducible:

100%

Steps to reproduce:

1. Discover 5 hosts

2. Get 1 host to be in disconnected by:
a. virsh domblklist master-0-0

b. virsh change-media master-0-0 --eject hdd

3. try to change cluster  - generate new iso, change host role

Actual results:

 

Expected results:

The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. We'll need the components setting manifests for these deprecated APIs to move to modern APIs. And then we should drop our ability to reconcile the deprecated APIs, to avoid having other components leak back in to using them.

Specifically cluster-monitoring-operator touches:

Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times

Full output of the test at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27560/pull-ci-openshift-origin-master-e2e-gcp-ovn/1593697975584952320/artifacts/e2e-gcp-ovn/openshift-e2e-test/build-log.txt:

[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:27
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:158
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:159
flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Ginkgo exit error 4: exit with code 4

This is required to unblock https://github.com/openshift/origin/pull/27561

Description of problem:

When running a hypershift HostedCluster with a publicAndPrivate / private setup behind a proxy, Nodes never go ready.

ovn-kubernetes pods fail to run because the init container fails.

[root@ip-10-0-129-223 core]# crictl logs cf142bb9f427d
+ [[ -f /env/ ]]
++ date -Iseconds
2023-01-25T12:18:46+00:00 - checking sbdb
+ echo '2023-01-25T12:18:46+00:00 - checking sbdb'
+ echo 'hosts: dns files'
+ proxypid=15343
+ ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
+ sbdb_ip=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645
+ retries=0
+ ovn-sbctl --no-leader-only --timeout=5 --db=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt get-connection
+ exec socat TCP-LISTEN:9645,reuseaddr,fork PROXY:10.0.140.167:ovnkube-sbdb.apps.agl-proxy.hypershift.local:443,proxyport=3128
ovn-sbctl: ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645: database connection failed ()
+ ((  retries += 1  ))


Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always.

Steps to Reproduce:

1. Create a publicAndPrivate hypershift HostedCluster behind a proxy. E.g"
➜  hypershift git:(main) ✗ ./bin/hypershift create cluster \
aws --pull-secret ~/www/pull-secret-ci.txt \
--ssh-key ~/.ssh/id_ed25519.pub \
--name agl-proxy \
--aws-creds ~/www/config/aws-osd-hypershift-creds \
--node-pool-replicas=3 \
--region=us-east-1 \
--base-domain=agl.hypershift.devcluster.openshift.com \
--zones=us-east-1a \
--endpoint-access=PublicAndPrivate \
--external-dns-domain=agl-services.hypershift.devcluster.openshift.com --enable-proxy=true

2. Get the kubeconfig for the guest cluster. E.g
kubectl get secret -nclusters agl-proxy-admin-kubeconfig  -oyaml

3. Get pods in the guest cluster.
See ovnkube-node pods init container failing with
[root@ip-10-0-129-223 core]# crictl logs cf142bb9f427d
+ [[ -f /env/ ]]
++ date -Iseconds
2023-01-25T12:18:46+00:00 - checking sbdb
+ echo '2023-01-25T12:18:46+00:00 - checking sbdb'
+ echo 'hosts: dns files'
+ proxypid=15343
+ ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
+ sbdb_ip=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645
+ retries=0
+ ovn-sbctl --no-leader-only --timeout=5 --db=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt get-connection
+ exec socat TCP-LISTEN:9645,reuseaddr,fork PROXY:10.0.140.167:ovnkube-sbdb.apps.agl-proxy.hypershift.local:443,proxyport=3128
ovn-sbctl: ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645: database connection failed ()
+ ((  retries += 1  ))

To create a bastion an ssh into the Nodes See https://hypershift-docs.netlify.app/how-to/debug-nodes/

Actual results:

Nodes unready

Expected results:

Nodes go ready

Additional info:

 

Persistent build failures have been detected for following components:

  • openshift-enterprise-builder-container
[2023-02-13 16:07:03,317 cachito.workers.tasks.utils DEBUG utils.get_request_state] Getting the state of request 639952 [2023-02-13 16:07:03,461 cachito.workers.tasks.general INFO general.fetch_app_source] Fetching the source from "https://github.com/openshift-priv/builder" at reference "9caad619934afe7d04f14e68b3d2cb55b205bf6d" [2023-02-13 16:07:03,462 cachito.workers.tasks.utils INFO utils.set_request_state] Setting the state of request 639952 to "in_progress" with the reason "Fetching the application source" [2023-02-13 16:07:03,618 cachito.workers.scm DEBUG scm.repo_name] Parsed the repository name "openshift-priv/builder" from https://github.com/openshift-priv/builder [2023-02-13 16:07:03,618 cachito.workers.paths DEBUG paths.__new__] Ensure directory /var/lib/cachito/sources/openshift-priv/builder exists. [2023-02-13 16:07:03,621 cachito.workers.scm DEBUG scm.fetch_source] The archive already exists at "/var/lib/cachito/sources/openshift-priv/builder/9caad619934afe7d04f14e68b3d2cb55b205bf6d.tar.gz" [2023-02-13 16:07:03,621 cachito.workers.scm DEBUG scm._verify_archive] Verifying the archive at /var/lib/cachito/sources/openshift-priv/builder/9caad619934afe7d04f14e68b3d2cb55b205bf6d.tar.gz [2023-02-13 16:07:19,356 cachito.workers.paths DEBUG paths.__new__] Ensure directory /var/lib/cachito/bundles/temp/639952 exists. [2023-02-13 16:07:19,357 
cachito.workers.paths DEBUG paths._new] Ensure directory /var/lib/cachito/bundles/temp/639952/deps exists. [2023-02-13 16:07:19,358 cachito.workers.tasks.general DEBUG general.fetch_app_source] Extracting /var/lib/cachito/sources/openshift-priv/builder/9caad619934afe7d04f14e68b3d2cb55b205bf6d.tar.gz to /var/lib/cachito/bundles/temp/639952 [2023-02-13 16:07:37,696 celery.app.trace INFO trace.info] Task cachito.workers.tasks.general.fetch_app_source[3c6d1b1c-cadf-4722-842b-ac0dd1929f26] succeeded in 34.38259743398521s: None [2023-02-13 16:07:37,700 cachito.workers.tasks.utils DEBUG utils.get_request_state] Getting the state of request 639952 [2023-02-13 16:07:37,843 cachito.workers.tasks.gomod INFO gomod.fetch_gomod_source] Go version: go version go1.18.10 linux/amd64 [2023-02-13 16:07:37,844 cachito.workers.paths DEBUG paths.__new__] Ensure directory /var/lib/cachito/bundles/temp/639952 exists. [2023-02-13 16:07:37,844 cachito.workers.paths DEBUG paths.__new__] Ensure directory /var/lib/cachito/bundles/temp/639952/deps exists. [2023-02-13 16:07:37,846 cachito.workers.tasks.gomod DEBUG gomod._find_missing_gomod_files] Testing for go mod file in go.mod [2023-02-13 16:07:38,055 cachito.workers.tasks.gomod INFO gomod.fetch_gomod_source] Fetching the gomod dependencies for request 639952 in subpath . [2023-02-13 16:07:38,055 cachito.workers.tasks.utils INFO utils.set_request_state] Setting the state of request 639952 to "in_progress" with the reason "Fetching the gomod dependencies at the "." directory" [2023-02-13 16:07:38,184 cachito.workers.tasks.utils DEBUG utils.get_request] Getting request 639952 [2023-02-13 16:07:38,277 cachito.workers.pkg_managers.gomod INFO gomod.resolve_gomod] Downloading the gomod dependencies [2023-02-13 16:07:38,277 cachito.workers.pkg_managers.gomod DEBUG gomod.run_go] Running ('go', 'mod', 'download') [2023-02-13 16:08:21,502 cachito.workers ERROR __init__.run_cmd] The command "go list -mod readonly -m -f {{ if not .Main }}{{ .String }}{{ end }} all" failed with: go: k8s.io/dynamic-resource-allocation@v0.0.0: reading https://cachito-athens.cachito-prod.svc/k8s.io/dynamic-resource-allocation/@v/v0.0.0.info: 404 Not Found [2023-02-13 16:08:23,586 cachito.workers.tasks.utils INFO utils.set_request_state] Setting the state of request 639952 to "failed" with the reason "Processing gomod dependencies failed" [2023-02-13 16:08:29,021 celery.app.trace ERROR trace._log_error] Task cachito.workers.tasks.gomod.fetch_gomod_source[e1183761-a3ac-4473-b4a8-7a7ba8e7e466] raised unexpected: CachitoError('Processing gomod dependencies failed') Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task R = retval = fun(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call_ return self.run(*args, **kwargs) File "/src/cachito/workers/tasks/utils.py", line 145, in task_with_state_check return task_fn(*args, **kwargs) File "/src/cachito/workers/tasks/gomod.py", line 187, in fetch_gomod_source gomod = resolve_gomod( File "/src/cachito/workers/pkg_managers/gomod.py", line 221, in resolve_gomod go_list_output = run_gomod_cmd( File "/src/cachito/workers/_init_.py", line 43, in run_cmd raise CachitoCalledProcessError( cachito.errors.CachitoError: Processing gomod dependencies failed

They all seem related to the same issue with go dependencies:

atomic_reactor.utils.cachito - ERROR - Request <x> is in "failed" state: Processing gomod dependencies failed 

Description of problem:

4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.

Version-Release number of selected component (if applicable):

4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.

How reproducible:

100%

Steps to Reproduce:

1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.

Actual results:

null

Expected results:

{"baselineCapabilitySet":"None"}

When the user specifies the 'vendor' hint, it actually checks for the value of the 'model' hint in the vendor field.

Tracker issue for bootimage bump in 4.13. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-2996.

Please review the following PR: https://github.com/openshift/alibaba-cloud-csi-driver/pull/20

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

During first bootstrap boot we need crio and kubelet on the disk, so we start release-image-pivot systemd task. However, its not blocking bootkube, so these two run in parallel.

release-image-pivot restarts the node to apply new OS image, which may leave bootkube in an inconsistent state. This task should run before bootkube

 Juniper have a problem with their deployment in case we apply node labels before node is read. In order to fix it we should apply them only in case node is ready

This is a clone of issue OCPBUGS-10794. The following is the description of the original issue:

Description of problem:

Our telemetry contains only vCenter version ("7.0.3") and not the exact build number. We need the build number to know what exact vCenter build user has and what bugs are fixed there (e.g. https://issues.redhat.com/browse/OCPBUGS-5817).

 

Description of problem:

When installing a 3 master + 2 worker BM IPv6 cluster with proxy, worker BMHs are failing inspection with the message: "Could not contact ironic-inspector for version discovery: Unable to find a version discovery document". This causes the installation to fail due to nodes with worker role never joining the cluster. However, when installing with no workers, the issue does not reproduce and the cluster installs successfully.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-04-203333

How reproducible:

100%

Steps to Reproduce:

1. Attempt to install an IPv6 cluster with 3 masters + 2 workers and proxy with baremetal installer

Actual results:

Installation never completes because a number of pods are in Pending status

Expected results:

Workers join the cluster and installation succeeds 

Additional info:

$ oc get events
LAST SEEN   TYPE     REASON              OBJECT                               MESSAGE
174m        Normal   InspectionError     baremetalhost/openshift-worker-0-1   Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
174m        Normal   InspectionError     baremetalhost/openshift-worker-0-0   Failed to inspect hardware. Reason: unable to start inspection: Could not contact ironic-inspector for version discovery: Unable to find a version discovery document at https://[fd2e:6f44:5dd8::37]:5050, the service is unavailable or misconfigured. Required version range (any - any), version hack disabled.
174m        Normal   InspectionStarted   baremetalhost/openshift-worker-0-0   Hardware inspection started
174m        Normal   InspectionStarted   baremetalhost/openshift-worker-0-1   Hardware inspection started

In OPNET-133, this PR merged and now if the API or Ingress VIPs are on a subnet that is not part of any interface on a worker node, the worker will be labelled as a remote-worker.

 

This is problematic in the context of External Load-Balancer where we don't use Keepalived anyway and relocate the VIP management to an external system, therefore the VIP can be part of any subnet, this is not OpenShift's business.

 

The problematic code is here: https://github.com/openshift/baremetal-runtimecfg/pull/207/files#diff-2b9ef0949e77d903141e49d824606ea166ffe2e95ad18a302212a49a17149191R116-R122

Current behavior
kube-api assisted service should approve day-2 host CSRs when "user-managed-networking"

Expected behavior
kube-api assisted service should approve day-2 host CSRs when "user-managed-networking" or when <not "user-managed-networking" and the day-2 hosts don't have BMHs>

Why
Typically, we don't bother approving host CSRs when not "user-managed-networking" because the baremetal platform takes care of that for us.

But the baremetal platform only takes care of that for us when we create Machine resources for the day-2 host.

But we only create said Machine resources when the day-2 host has a BMH.

So to summarize, when the day-2 host doesn't have BMHs, and the cluster-to-be-joined is not a "user-managed-networking" cluster, the service currently doesn't approve CSRs, but it should, because the baremetal platform won't do it, because we don't create a Machine resource for the day-2 host

In situations where users are struggling to determine why their network configuration is not taking effect it would be helpful to log a bit more verbosely from the service that sets up the network config files.

Specifically if a host does not match any of the provided mappings we currently write "None of host directories are a match for the current host" which is not particularly helpful. It should at least say something about possible mac address mismatches. We probably could also find a few more places to log something useful.

This case [1] in particular ended up being a mac mismatch that was a bit difficult to find.

[1] https://coreos.slack.com/archives/CUPJTHQ5P/p1665763066456429

This is a clone of issue OCPBUGS-10376. The following is the description of the original issue:

Description of problem:

new microshift commands were added in oc gen-docs by https://github.com/openshift/oc/pull/1357. However, there is no definition to trigger these commands via Makefile.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

make generate-docs-microshift and make generate-docs-admin-microshift commands work

Additional info:

 

Description of problem:

with public/private DNS zones in the service project specified, after destroying cluster, related record-sets are not deleted

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install 4.12.0-0.nightly-2022-10-25-210451
built from commit 14d496fdaec571fa97604a487f5df6a0433c0c68
release image registry.ci.openshift.org/ocp/release@sha256:d6cc07402fee12197ca1a8592b5b781f9f9a84b55883f126d60a3896a36a9b74
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. try IPI installation to a shared VPC, with public/private DNS zones in the service project
2. try destroying the cluster

Actual results:

After destroying the cluster, the dns record-sets created by installer are not deleted.

Expected results:

The dns record-sets created by the installer should be deleted when destroying the cluster.

Additional info:

1. the pre-configured DNS zones in the service project
$ gcloud dns managed-zones list --filter='name=qe1'
NAME  DNS_NAME                           DESCRIPTION  VISIBILITY
qe1   qe1.gcp.devcluster.openshift.com.               public
$ gcloud dns managed-zones list --filter='name=ipi-xpn-private-zone'
NAME                  DNS_NAME                                       DESCRIPTION                         VISIBILITY
ipi-xpn-private-zone  jiwei-1026a.qe1.gcp.devcluster.openshift.com.  Preserved private zone for IPI XPN  private
$ 

2. the install-config snippet
$ yq-3.3.0 r test4/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
  computeSubnet: installer-shared-vpc-subnet-2
  controlPlaneSubnet: installer-shared-vpc-subnet-1
  createFirewallRules: Disabled
  publicDNSZone:
    id: qe1
  privateDNSZone:
    id: ipi-xpn-private-zone
  network: installer-shared-vpc
  networkProjectID: openshift-qe-shared-vpc
$ yq-3.3.0 r test4/install-config.yaml baseDomain
qe1.gcp.devcluster.openshift.com
$ 

3. manually create the required credentials and then try creating cluster, which failed finally (see https://issues.redhat.com/browse/OCPBUGS-2877)4. destroy the cluster and then make sure everything created by the installer would be deleted
$ openshift-install destroy cluster --dir test4
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
INFO Stopped instance jiwei-1026a-sx4ph-worker-a-9xhnn
INFO Stopped instance jiwei-1026a-sx4ph-worker-b-ctfw9
INFO Stopped instance jiwei-1026a-sx4ph-master-1
INFO Stopped instance jiwei-1026a-sx4ph-master-2
INFO Stopped instance jiwei-1026a-sx4ph-master-0
INFO Deleted IAM project role bindings
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a-sx4ph-w@openshift-qe.iam.gserviceaccount.com
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--openshift-g-16867@openshift-qe.iam.gserviceaccount.com
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a-sx4ph-m@openshift-qe.iam.gserviceaccount.com
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--openshift-g-2385@openshift-qe.iam.gserviceaccount.com 
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--cloud-crede-22053@openshift-qe.iam.gserviceaccount.com 
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--openshift-i-6003@openshift-qe.iam.gserviceaccount.com 
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--openshift-i-18195@openshift-qe.iam.gserviceaccount.com 
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--openshift-c-23280@openshift-qe.iam.gserviceaccount.com 
INFO Deleted service account projects/openshift-qe/serviceAccounts/jiwei-1026a--openshift-m-17306@openshift-qe.iam.gserviceaccount.com 
INFO Deleted bucket jiwei-1026a-sx4ph-image-registry-us-central1-osbvfoiraqweywmet 
INFO Deleted instance jiwei-1026a-sx4ph-master-0  
INFO Deleted instance jiwei-1026a-sx4ph-worker-a-9xhnn 
INFO Deleted instance jiwei-1026a-sx4ph-master-1  
INFO Deleted instance jiwei-1026a-sx4ph-worker-b-ctfw9 
INFO Deleted instance jiwei-1026a-sx4ph-master-2  
INFO Deleted disk jiwei-1026a-sx4ph-master-1      
INFO Deleted disk jiwei-1026a-sx4ph-worker-b-ctfw9 
INFO Deleted disk jiwei-1026a-sx4ph-master-2
INFO Deleted disk jiwei-1026a-sx4ph-master-0
INFO Deleted disk jiwei-1026a-sx4ph-worker-a-9xhnn
INFO Deleted address jiwei-1026a-sx4ph-cluster-ip
INFO Deleted address jiwei-1026a-sx4ph-cluster-public-ip
INFO Deleted forwarding rule jiwei-1026a-sx4ph-api
INFO Deleted forwarding rule jiwei-1026a-sx4ph-api-internal
INFO Deleted target pool jiwei-1026a-sx4ph-api
INFO Deleted backend service jiwei-1026a-sx4ph-api-internal
INFO Deleted instance group jiwei-1026a-sx4ph-master-us-central1-c
INFO Deleted instance group jiwei-1026a-sx4ph-master-us-central1-b
INFO Deleted instance group jiwei-1026a-sx4ph-master-us-central1-a
INFO Deleted health check jiwei-1026a-sx4ph-api-internal
INFO Deleted HTTP health check jiwei-1026a-sx4ph-api
INFO Time elapsed: 4m13s   
$ 
$ gcloud dns record-sets list --zone qe1 --format="table(type,name,rrdatas)" --filter="name~jiwei-1026a"
TYPE  NAME                                               RRDATAS
A     api.jiwei-1026a.qe1.gcp.devcluster.openshift.com.  ['34.71.50.187']
$ 
$ gcloud dns record-sets list --zone ipi-xpn-private-zone --format="table(type,name,rrdatas)" --filter="name~jiwei-1026a AND type=A"
TYPE  NAME                                                   RRDATAS
A     api.jiwei-1026a.qe1.gcp.devcluster.openshift.com.      ['10.0.0.10']
A     api-int.jiwei-1026a.qe1.gcp.devcluster.openshift.com.  ['10.0.0.10']
$

 

 

Description of problem:

4.13 ocp install fails at maching-config. 

$ oc get co:
..
machine-config                                                             True        True          True       68m     Unable to apply 4.13.0-0.ci-2023-01-15-132450: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]

# error in machine-config pod log:
I0116 16:30:56.832225   56878 daemon.go:1257] In bootstrap mode
E0116 16:30:56.832288   56878 writer.go:200] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-82c786ab9a48b1ac17baba2e07bcc19d" not found

Version-Release number of selected component (if applicable):

4.13.0-0.ci-2023-01-15-132450 
4.13.0-0.nightly-2023-01-11-061758

How reproducible:

100%  

Steps to Reproduce:

1. Install SNO with 4.13 nightly or ci build
2. monitor the install via "oc get clusterversion" and "oc get co"
3.

Actual results:

OCP install failed due to MCP is degraded

[kni@registry.ran-vcl01 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       False         66m     Error while reconciling 4.13.0-0.ci-2023-01-15-132450: the cluster operator machine-config is degraded

Expected results:

OCP install succeeded

Additional info:

Reproducible on SNO with 4.13 nightly or ci builds.

$ oc get co:
..
machine-config                                                             True        True          True       68m     Unable to apply 4.13.0-0.ci-2023-01-15-132450: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)] 

# error in machine-config pod log:
I0116 16:30:56.832225   56878 daemon.go:1257] In bootstrap mode E0116 16:30:56.832288   56878 writer.go:200] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-82c786ab9a48b1ac17baba2e07bcc19d" not found

 

 1. Analyze the current gc flow that ends with host deletion and find caveats
 2. provide an additional mechanism for cleaning up orphan hosts 

Looking at production (SAAS) there are several types of hosts still hangs out in the hosts table while their cluster is already gone:
1) Disabled hosts (a deprecate status)
2) Disconnected Hosts
3) Some hosts that are in a valid state but has no corresponding cluster/infraenv 

Description of problem:

[AWS-EBS-CSI-Driver] provision volume using customer kms key couldn't restore its snapshot successfully

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.12.0-ec.3
Kustomize Version: v4.5.4
Server Version: 4.13.0-0.nightly-2023-01-01-223309
Kubernetes Version: v1.25.2+0003605

I tested with 4.11.z and 4.12 nightly also have the same issue

How reproducible:

Always

Steps to Reproduce:

1. Create aws ebs csi storageClass with customer managed kms key, volumeBindingMode: Immediate;
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: my-kms-csi
provisioner: ebs.csi.aws.com
parameters:
  kmsKeyId: 'arn:aws:kms:us-east-2:301721915996:key/17e63c2f-0c10-4680-97a2-4664f974e2e4'
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

2. Create pvc with the csi storageClass and after the volume provisioned succeed create snapshot for the volume with preset VolumeSnapshotClasse/csi-aws-vsc;
# Origin pvc
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-ori
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: my-kms-csi
  volumeMode: Filesystem
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  annotations:
    snapshot.storage.kubernetes.io/pvc-access-modes: ReadWriteOnce
    snapshot.storage.kubernetes.io/pvc-volume-mode: Filesystem
  name: pvc-ori-snapshot
spec:
  source:
    persistentVolumeClaimName: pvc-ori
    volumeSnapshotClassName: csi-aws-vsc

3. Waiting for the volumesnapshot/pvc-ori-snapshot ReadyToUse create pvc restore the snapshot with storageClass/my-kms-csi
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-ori-restore
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: my-kms-csi
  volumeMode: Filesystem
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: pvc-ori-snapshot

4. Waiting for the restored volume provision succeed. 

Actual results:

In Step4 : The volume couldn't be provisioned successfully, pvc stuck at 'Pending'
failed to provision volume with StorageClass "my-kms-csi": rpc error: code = Internal desc = Could not create volume "pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-002e6f75fc9d2e868' does not exist. status code: 400, request id: 2361646d-a9af-4bb2-a2e1-7268bf032292

Expected results:

In Step4 : The volume should be provisioned successfully

Additional info:

$ oc logs -l app=aws-ebs-csi-driver-controller -c csi-provisioner --tail=-1 | grep 'pvc-ori-restore'
I0105 07:00:26.428554       1 controller.go:1337] provision "default/pvc-ori-restore" class "my-kms-csi": started
I0105 07:00:26.428831       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"pvc-ori-restore", UID:"a1dd6aa6-1339-4cf1-9e10-16580e00ef0a", APIVersion:"v1", ResourceVersion:"170970", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/pvc-ori-restore"
I0105 07:00:26.436091       1 connection.go:184] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2c"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2a"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2b"}}],"requisite":[{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2c"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2a"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2b"}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a","parameters":{"csi.storage.k8s.io/pv/name":"pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a","csi.storage.k8s.io/pvc/name":"pvc-ori-restore","csi.storage.k8s.io/pvc/namespace":"default","kmsKeyId":"arn:aws:kms:us-east-2:301721915996:key/17e63c2f-0c10-4680-97a2-4664f974e2e4"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Snapshot":{"snapshot_id":"snap-0c3b1cb7358296c1f"}}}}
I0105 07:00:29.892138       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"pvc-ori-restore", UID:"a1dd6aa6-1339-4cf1-9e10-16580e00ef0a", APIVersion:"v1", ResourceVersion:"170970", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "my-kms-csi": rpc error: code = Internal desc = Could not create volume "pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-002e6f75fc9d2e868' does not exist.
I0105 07:00:30.893007       1 controller.go:1337] provision "default/pvc-ori-restore" class "my-kms-csi": started
I0105 07:00:30.893113       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"pvc-ori-restore", UID:"a1dd6aa6-1339-4cf1-9e10-16580e00ef0a", APIVersion:"v1", ResourceVersion:"170970", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/pvc-ori-restore"
I0105 07:00:30.899636       1 connection.go:184] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2c"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2a"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2b"}}],"requisite":[{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2a"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2b"}},{"segments":{"topology.ebs.csi.aws.com/zone":"us-east-2c"}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a","parameters":{"csi.storage.k8s.io/pv/name":"pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a","csi.storage.k8s.io/pvc/name":"pvc-ori-restore","csi.storage.k8s.io/pvc/namespace":"default","kmsKeyId":"arn:aws:kms:us-east-2:301721915996:key/17e63c2f-0c10-4680-97a2-4664f974e2e4"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Snapshot":{"snapshot_id":"snap-0c3b1cb7358296c1f"}}}}
I0105 07:00:30.902068       1 round_trippers.go:553] PATCH https://172.30.0.1:443/api/v1/namespaces/default/events/pvc-ori-restore.17375787954e8b58 200 OK in 8 milliseconds
I0105 07:00:31.207107       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"pvc-ori-restore", UID:"a1dd6aa6-1339-4cf1-9e10-16580e00ef0a", APIVersion:"v1", ResourceVersion:"170970", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "my-kms-csi": rpc error: code = AlreadyExists desc = Could not create volume "pvc-a1dd6aa6-1339-4cf1-9e10-16580e00ef0a": Parameters on this idempotent request are inconsistent with parameters used in previous request(s)

Please review the following PR: https://github.com/openshift/cluster-authentication-operator/pull/595

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

As mentioned in this comment on the PR used to separate out tests for the "NodeHas*" events for TRT-596, it looks like the events with lower numbers need to be counted as passing.

We need to follow similar logic (like this) when the tests were still part of the "[sig-arch] events should not repeat pathologically" test.

Description of the problem:

In staging, BE 2.13.1 - While cluster is installing, if booting other hosts with cluster's iso causing the hosts to try to register to the cluster, but cluster mode is installing/installed and can't add more hosts. This makes the agent to send a message to the service every minute which piles up to huge amount of messages
While cluster is installing the messages are:

 Host can register only in one of the following states: [insufficient ready pending-for-input adding-hosts] 

After cluster is installed:

  Cannot add hosts to an existing cluster using the original Discovery 
ISO. Try to add new hosts by using the Discovery ISO that can be found 
in console.redhat.com under your cluster “Add hosts“ tab.

How reproducible:

100%

Steps to reproduce:

1. Discover 5 nodes 

2. after discovery, delete the hosts from cluster and discover 5 new others

3. start installation with new discovered hosts

4. Turn on the previous 5 nodes that were deleted and shutdown

Actual results:

 

Expected results:

Description of problem:

Two tests are perma failing in metal-ipi upgrade tests
[sig-imageregistry] Image registry remains available using new connections expand_more    39m27s
[sig-imageregistry] Image registry remains available using reused connections expand_more    39m27s

Version-Release number of selected component (if applicable):

4.12 / 4.13

How reproducible:

all ci runs

Steps to Reproduce:

1.
2.
3.

Actual results:

Nov 24 02:58:26.998: INFO: "[sig-imageregistry] Image registry remains available using reused connections": panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

pass

Additional info:

 

Description of problem:
Default Git type to other info alert appears if we are not able to detect the git type. Alert seems misleading after changing the Git type.  

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. In import from the git flow enter the git URL https://stash.danskenet.net/scm/~bc0508/github.com_sclorg_django-ex.git
2. Notice the info alert below the Git type field
3. change the git type

Actual results:

Default Git type to other info alert always appears if we are not able to detect the git type

Expected results:

Default Git type to other info alert should be removed if user change the git type

Additional info:

 

Description of problem:

We have ODF bug for it here: https://bugzilla.redhat.com/show_bug.cgi?id=2169779

Discussed in formu-storage with Hemant here:
https://redhat-internal.slack.com/archives/CBQHQFU0N/p1677085216391669

And asked to open bug for it.

This currently blocking ODF 4.13 deployment over vSphere

Version-Release number of selected component (if applicable):

 

How reproducible:

YES

Steps to Reproduce:

1. Deploy ODF 4.13 on vSphere with `thin-csi` SC
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

The infraenv and cluster PATCH API silently doesn't do anything when users make typos and doesn't return any errors

How reproducible:

100%

Steps to reproduce:

1. Call the PATCH API and make a typo in one of the field names (e.g. cluster PATCH with "base_dns_doman" instead of "base_dns_domain" 

Actual results:

Call succeeds

Expected results:
Call should fail

Exposed via the fact that the periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4 job is at 0% for at least the past two weeks over approximatesly 65 runs.

Testgrid shows that this job started failing in a very consistent way on Oct 25th at about 8am UTC: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4

6 disruption tests fail, all with alarming consistency virtually always claiming exactly 8s of disruption, max allowed 1s.

And then openshift-tests.[sig-arch] events should not repeat pathologically fails with an odd signature:

{  6 events happened too frequently

event happened 35 times, something is wrong: node/master-2 - reason/NodeHasNoDiskPressure roles/control-plane,master Node master-2 status is now: NodeHasNoDiskPressure
event happened 35 times, something is wrong: node/master-2 - reason/NodeHasSufficientMemory roles/control-plane,master Node master-2 status is now: NodeHasSufficientMemory
event happened 35 times, something is wrong: node/master-2 - reason/NodeHasSufficientPID roles/control-plane,master Node master-2 status is now: NodeHasSufficientPID
event happened 35 times, something is wrong: node/master-1 - reason/NodeHasNoDiskPressure roles/control-plane,master Node master-1 status is now: NodeHasNoDiskPressure
event happened 35 times, something is wrong: node/master-1 - reason/NodeHasSufficientMemory roles/control-plane,master Node master-1 status is now: NodeHasSufficientMemory
event happened 35 times, something is wrong: node/master-1 - reason/NodeHasSufficientPID roles/control-plane,master Node master-1 status is now: NodeHasSufficientPID}

The two types of tests started failing together exactly, and the disruption measurements are bizzarely consistent, every single time we see precisely 8s for kube-api, cache-kube-api, openshift-api, cache-openshift-api, oauth-api, cache-oauth-api. It's always these 6, and it seems to be always exactly 8 seconds. I cannot state enough how strange this is. It almost implies that something is happening on a very consistent schedule.

Occasionally these are accompanied by 1-2s of disruption for those backends with new connections, but sometimes not as well.

It looks like all of the disruption consistently happens within two very long tests:

4s within: [sig-network] services when running openshift ipv4 cluster ensures external ip policy is configured correctly on the cluster [Serial] [Suite:openshift/conformance/serial]

4s within: [sig-network] services when running openshift ipv4 cluster on bare metal [apigroup:config.openshift.io] ensures external auto assign cidr is configured correctly on the cluster [Serial] [Suite:openshift/conformance/serial]

Both tests appear to have run prior to oct 25, so I don't think it's a matter of new tests breaking something or getting unskipped. Both tests also always pass, but appear to be impacting the cluster?

The master's going NotReady also appears to fall within the above two tests as well, though it does not seem to directly match with when we measure disruption, but bear in mind there's a 40s delay before the node goes NotReady.

Focusing on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4/1590640492373086208 where the above are from:

Two of the three master nodes appear to be going NodeNotReady a couple times throughout the run, as visible in the spyglass chart under the node state row on the left. master-0 does not appear here, but it does exist. (I suspect it has leader and thus is the node reporting the others going not ready)

From the master-0 kubelet log in must-gather we can see one of these examples where it reports that master-2 has not checked in:

2022-11-10T10:38:35.874090961Z I1110 10:38:35.873975       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.00700561s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2022-11-10 1
0:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletReady,Message:kubelet is posting ready status,}
2022-11-10T10:38:35.874090961Z I1110 10:38:35.874056       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007097549s. Last MemoryPressure is: &NodeCondition{Type:MemoryPressure,Status:False,LastHeartb
eatTime:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient memory available,}
2022-11-10T10:38:35.874090961Z I1110 10:38:35.874067       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007110285s. Last DiskPressure is: &NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatT
ime:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,}
2022-11-10T10:38:35.874090961Z I1110 10:38:35.874076       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007119541s. Last PIDPressure is: &NodeCondition{Type:PIDPressure,Status:False,LastHeartbeatTim
e:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,}
2022-11-10T10:38:35.881749410Z I1110 10:38:35.881705       1 controller_utils.go:181] "Recording status change event message for node" status="NodeNotReady" node="master-2"
2022-11-10T10:38:35.881749410Z I1110 10:38:35.881733       1 controller_utils.go:120] "Update ready status of pods on node" node="master-2"
2022-11-10T10:38:35.881820988Z I1110 10:38:35.881799       1 controller_utils.go:138] "Updating ready status of pod to false" pod="metal3-b7b69fdbb-rfbdj"
2022-11-10T10:38:35.881893234Z I1110 10:38:35.881858       1 topologycache.go:179] Ignoring node master-2 because it has an excluded label
2022-11-10T10:38:35.881893234Z W1110 10:38:35.881886       1 topologycache.go:199] Can't get CPU or zone information for worker-0 node
2022-11-10T10:38:35.881903023Z I1110 10:38:35.881892       1 topologycache.go:215] Insufficient node info for topology hints (0 zones, %!s(int64=0) CPU, false)
2022-11-10T10:38:35.881932172Z I1110 10:38:35.881917       1 controller.go:271] Node changes detected, triggering a full node sync on all loadbalancer services
2022-11-10T10:38:35.882290428Z I1110 10:38:35.882270       1 event.go:294] "Event occurred" object="master-2" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node master-2 status is now: NodeNotReady"

Now from master-2's kubelet log around that time, 40 seconds earlier puts us at 10:37:55, so we'd be looking for something odd around there.

A few potential lines:

Nov 10 10:37:55.232537 master-2 kubenswrapper[1930]: I1110 10:37:55.232495    1930 patch_prober.go:29] interesting pod/kube-controller-manager-guard-master-2 container/guard namespace/openshift-kube-controller-manager: Readiness probe status=failure output="Get \"https://192.168.111.22:10257/healthz\": dial tcp 192.168.111.22:10257: connect: connection refused" start-of-body=

Nov 10 10:37:55.232537 master-2 kubenswrapper[1930]: I1110 10:37:55.232549    1930 prober.go:114] "Probe failed" probeType="Readiness" pod="openshift-kube-controller-manager/kube-controller-manager-guard-master-2" podUID=8be2c6c1-f8f6-4bf0-b26d-53ce487354bd containerName="guard" probeResult=failure output="Get \"https://192.168.111.22:10257/healthz\": dial tcp 192.168.111.22:10257: connect: connection refused"

Nov 10 10:38:12.238273 master-2 kubenswrapper[1930]: E1110 10:38:12.238229    1930 controller.go:187] failed to update lease, error: Put "https://api-int.ostest.test.metalkube.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master-2?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Nov 10 10:38:13.034109 master-2 kubenswrapper[1930]: E1110 10:38:13.034077    1930 kubelet_node_status.go:487] "Error updating node status, will retry" err="error getting node \"master-2\": Get \"https://api-int.ostest.test.metalkube.org:6443/api/v1/nodes/master-2?resourceVersion=0&timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

At 10:38:40 all kinds of master-2 watches time out with messages like:

Nov 10 10:38:40.244399 master-2 kubenswrapper[1930]: W1110 10:38:40.244272    1930 reflector.go:347] object-"openshift-oauth-apiserver"/"kube-root-ca.crt": watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

And then suddenly we're back online:

Nov 10 10:38:40.252149 master-2 kubenswrapper[1930]: I1110 10:38:40.252131    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasSufficientMemory"
Nov 10 10:38:40.252149 master-2 kubenswrapper[1930]: I1110 10:38:40.252156    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasNoDiskPressure"
Nov 10 10:38:40.252268 master-2 kubenswrapper[1930]: I1110 10:38:40.252165    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasSufficientPID"
Nov 10 10:38:40.252268 master-2 kubenswrapper[1930]: I1110 10:38:40.252177    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeReady"
Nov 10 10:38:47.904430 master-2 kubenswrapper[1930]: I1110 10:38:47.904373    1930 kubelet.go:2229] "SyncLoop (probe)" probe="readiness" status="" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:47.904842 master-2 kubenswrapper[1930]: I1110 10:38:47.904662    1930 kubelet.go:2229] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:47.907900 master-2 kubenswrapper[1930]: I1110 10:38:47.907872    1930 kubelet.go:2229] "SyncLoop (probe)" probe="startup" status="started" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:48.431448 master-2 kubenswrapper[1930]: I1110 10:38:48.431414    1930 kubelet.go:2229] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764029    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-master-2" status=Running
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764059    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/keepalived-master-2" status=Running
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764077    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/coredns-master-2" status=Running
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764086    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/haproxy-master-2" status=Running
Nov 10 10:38:54.764492 master-2 kubenswrapper[1930]: I1110 10:38:54.764106    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-etcd/etcd-master-2" status=Running
Nov 10 10:38:54.764492 master-2 kubenswrapper[1930]: I1110 10:38:54.764113    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" status=Running

Also curious:

Nov 10 10:37:50.318237 master-2 ovs-vswitchd[1324]: ovs|00251|connmgr|INFO|br0<->unix#468: 2 flow_mods in the last 0 s (2 deletes)
Nov 10 10:37:50.342965 master-2 ovs-vswitchd[1324]: ovs|00252|connmgr|INFO|br0<->unix#471: 4 flow_mods in the last 0 s (4 deletes)
Nov 10 10:37:50.364271 master-2 ovs-vswitchd[1324]: ovs|00253|bridge|INFO|bridge br0: deleted interface vethcb8d36e6 on port 41

Nov 10 10:37:53.579562 master-2 NetworkManager[1336]: <info>  [1668076673.5795] dhcp4 (enp2s0): state changed new lease, address=192.168.111.22

These look like they could be related to the tests these problems appear to coincide with?

Description of problem:

OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. 
Hub cluster is OCP 4.10.33
ACM is 2.5.4

When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. 

To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.

Version-Release number of selected component (if applicable):

OCP 4.10.33
ACM 2.5.4

How reproducible:

Always

Steps to Reproduce:

1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue:
  a. Invalid rootDeviceHint in BareMetalHost CR
  b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName
2. Apply installation CRs to hub cluster
3. Wait for cluster installation to fail
4. Remove cluster installation CRs and namespace

Actual results:

Cluster namespace remains in terminating state indefinitely:
$ oc get ns cnfocto1
NAME       STATUS        AGE    
cnfocto1   Terminating   17h

Expected results:

Cluster namespace (and all installation CRs in it) are successfully removed.

Additional info:

The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest):
Namespace: 0
AgentClusterInstall: 1
ClusterDeployment: 1
NMStateConfig: 1
InfraEnv: 1
BareMetalHost: 1
HostFirmwareSettings: 1
ConfigMap: 1 (extra manifests)
ManagedCluster: 2
KlusterletAddonConfig: 2

 

Description of the problem:

When adding day2 workers to an SNO cluster, they are not having their CSRs auto-approved so they do not automatically join the cluster.

Release version:
4.12.0-ec.3

Operator snapshot version:
2.2.0-DOWNANDBACK-2022-09-27-22-21-40

OCP version:
4.12

Steps to reproduce:
1. Create SNO cluster using the operator
2. Add day2 workers to the cluster

Actual results:
Node resources never appear because they have pending CSRs

Expected results:
CSRs are auto-approved and nodes join cluster automatically

Addional Info:

I have the suspicion that this affects all user-managed-networking clusters, but will update when I have more info.

Description of problem:

CVO hotloops on multiple different ImageStream resources, and while hotlooping, it logs the information incorrectly.

Hotloopng:

While checking the log file of the CVO in version 4.13.0-0.ci-2023-01-19-175836 we can see a new hotlooping on the ImageStream resources. Please note the whole lines. Even though the CVO logs "Updating Namespace...", we can see the additional information in the line "...due to diff:   &v1.ImageStream".

The output of checking for any hotlooping:

$ grep -io 'updating.*due to diff.*' cluster-version-operator-c98796f6b-cll9c-cluster-version-operator.log | sort | uniq -c
      2 Updating CRD alertmanagerconfigs.monitoring.coreos.com due to diff:   &v1.CustomResourceDefinition{
      2 Updating CRD consoleplugins.console.openshift.io due to diff:   &v1.CustomResourceDefinition{
      2 Updating CRD performanceprofiles.performance.openshift.io due to diff:   &v1.CustomResourceDefinition{
      2 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
      2 Updating Namespace cli-artifacts due to diff:   &v1.ImageStream{
      2 Updating Namespace cli due to diff:   &v1.ImageStream{
      2 Updating Namespace driver-toolkit due to diff:   &v1.ImageStream{
      2 Updating Namespace installer-artifacts due to diff:   &v1.ImageStream{
      2 Updating Namespace installer due to diff:   &v1.ImageStream{
      2 Updating Namespace must-gather due to diff:   &v1.ImageStream{
      2 Updating Namespace oauth-proxy due to diff:   &v1.ImageStream{
      2 Updating Namespace tests due to diff:   &v1.ImageStream{
      2 Updating Namespace tools due to diff:   &v1.ImageStream{
      2 Updating ValidatingWebhookConfiguration /controlplanemachineset.machine.openshift.io due to diff:   &unstructured.Unstructured{
      2 Updating ValidatingWebhookConfiguration /performance-addon-operator due to diff:   &unstructured.Unstructured{

Incorrect logging:

CVO logs "Updating Namespace" instead of "Updating ImageStream" as noted above.

It also doesn't log the reason why the update happened. CVO logs the diff but the diff doesn't include the exact lines that caused the update. This issue was already fixed for some other resources (for example, the commit that fixed a similar issue for deployments: "resourceapply: improve diff logging for deployments" [1]).


[1] https://github.com/openshift/cluster-version-operator/pull/855/commits/6065c601ae69ca63f16d12d34c1b2657a1f0d23d


Version-Release number of selected component (if applicable):

 

How reproducible:

1/1

Steps to Reproduce:

1. Install the cluster
2.
3.

Actual results:

CVO hotloops on the ImageStream resources and logs the information incorrectly.

Expected results:

CVO doesn't hotloop on the ImageStream resources. And in the case of updating the ImageStream resources, it logs the necessary information correctly.

Additional info:

 

When a HostedCluster is configured as `Private`, annotate the necessary hosted CP components (API and OAuth) so that External DNS can still create public DNS records (pointing to private IP resources).

The External DNS record should be pointing to the resource for the PrivateLink VPC Endpoint. "We need to specify the IP of the A record. We can do that with a cluster IP service."

Context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1675432805760719

DoD:

Automated way to generate release notes by component.

Formalised procedure to roll out the HO in ROSA

This is a clone of issue OCPBUGS-8215. The following is the description of the original issue:

Description of problem:

When setting no configuration for node-exporter in CMO config, we did not see the 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude in node-exporter daemonset, full info see: http://pastebin.test.redhat.com/1093428

and checked in 4.13.0-0.nightly-2023-02-27-101545, no configuration for node-exporter, there is collector.netclass.ignored-devices setting
see from: http://pastebin.test.redhat.com/1093429

after disabled netdev/netclass on bot cluster, would see collector.netclass.ignored-devices and collector.netdev.device-exclude settings in node-exporter, since OCPBUGS-7282 is filed on 4.12, disable netdev/netclass is not supported then, I don't think we should disable netdev/netclass

$ oc -n openshift-monitoring get ds node-exporter -oyaml | grep collector
        - --no-collector.wifi
        - --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*|cali[a-f0-9]*)$
        - --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*|cali[a-f0-9]*)$
        - --collector.cpu.info
        - --collector.textfile.directory=/var/node_exporter/textfile
        - --no-collector.cpufreq
        - --no-collector.tcpstat
        - --no-collector.netdev
        - --no-collector.netclass
        - --no-collector.buddyinfo
        - '[[ ! -d /node_exporter/collectors/init ]] || find /node_exporter/collectors/init

Version-Release number of selected component (if applicable):

4.13

How reproducible:


Steps to Reproduce:

The 2 arguments are missing when booting up OCP with default configurations for CMO.

Actual results:

The 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude are missing in node-exporter DaemonSet.

Expected results:

The 2 arguments collector.netclass.ignored-devices and collector.netdev.device-exclude are present in node-exporter DaemonSet.

Additional info:


This is a clone of issue OCPBUGS-10846. The following is the description of the original issue:

Description of problem

CI is flaky because the TestClientTLS test fails.

Version-Release number of selected component (if applicable)

I have seen these failures in 4.13 and 4.14 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 16.07% of runs (20.93% of failures) across 56 total runs and 13 jobs (76.79% failed) in 185ms

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestClientTLS&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails:

=== RUN   TestAll/parallel/TestClientTLS
=== PAUSE TestAll/parallel/TestClientTLS
=== CONT  TestAll/parallel/TestClientTLS
=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [8 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [313 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [313 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:24 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:24 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [802 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:25 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=104beed63d6a19782a5559400bd972b6; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown CA (560):
        { [2 bytes data]
        * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [8 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown (628):
        { [2 bytes data]
        * OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:57:00 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [802 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:57:00 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown CA (560):
        { [2 bytes data]
        * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

=== CONT  TestAll/parallel/TestClientTLS
--- FAIL: TestAll (1538.53s)
    --- FAIL: TestAll/parallel (0.00s)
        --- FAIL: TestAll/parallel/TestClientTLS (123.10s)

Expected results

CI passes, or it fails on a different test.

Additional info

I saw that TestClientTLS failed on the test case with no client certificate and ClientCertificatePolicy set to "Required". My best guess is that the test is racy and is hitting a terminating router pod. The test uses waitForDeploymentComplete to wait until all new pods are available, but perhaps waitForDeploymentComplete should also wait until all old pods are terminated.

This bug is a backport clone of [Bugzilla Bug 2073220](https://bugzilla.redhat.com/show_bug.cgi?id=2073220). The following is the description of the original bug:

Description of problem:

https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Version-Release number of selected component (if applicable): 4.*

How reproducible: always

Steps to Reproduce:
1. Set audit profile to WriteRequestBodies
2. Wait for api server rollout to complete
3. tail -f /var/log/kube-apiserver/audit.log | grep routes/status

Actual results:

Write events to routes/status are recorded at the RequestResponse level, which often includes keys and certificates.

Expected results:

Events involving routes should always be recorded at the Metadata level, per the documentation at https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Additional info:

This is a clone of issue OCPBUGS-8293. The following is the description of the original issue:

Bump to kube 1.26.2 to pick up fixes.

Noticed that openshift-apiserver's verify job sometimes fails because hack/update-generated-openapi.sh sometimes produces this diff:

diff --git a/pkg/openapi/zz_generated.openapi.go b/pkg/openapi/zz_generated.openapi.go
index e6ba4c015..dbe226362 100644
--- a/pkg/openapi/zz_generated.openapi.go
+++ b/pkg/openapi/zz_generated.openapi.go
@@ -61886,11 +61886,11 @@ func schema_k8sio_api_core_v1_LimitRangeItem(ref common.ReferenceCallback) commo
                                Properties: map[string]spec.Schema{
                                        "type": {
                                                SchemaProps: spec.SchemaProps{
-                                                       Description: "Type of resource that this limit applies to.\n\nPossible enum values:\n - `\"Container\"` Limit that applies to all containers in a namespace\n - `\"PersistentVolumeClaim\"` Limit that applies to all persistent volume claims in a namespace\n - `\"Pod\"` Limit that applies to all pods in a namespace",
+                                                       Description: "Type of resource that this limit applies to.\n\nPossible enum values:\n - `\"Container\"` Limit that applies to all containers in a namespace\n - `\"PersistentVolumeClaim\"` Limit that applies to all persistent volume claims in a namespace\n - `\"Pod\"` Limit that applies to all pods in a namespace\n - `\"openshift.io/Image\"` Limit that applies to images. Used with a max[\"storage\"] LimitRangeItem to set the maximum size of an image.\n - `\"openshift.io/ImageStream\"` Limit that applies to image streams. Used with a max[resource] LimitRangeItem to set the maximum number of resource. Where the resource is one of \"openshift.io/images\" and \"openshift.io/image-tags\".",
                                                        Default:     "",
                                                        Type:        []string{"string"},
                                                        Format:      "",
-                                                       Enum:        []interface{}{"Container", "PersistentVolumeClaim", "Pod"}},
+                                                       Enum:        []interface{}{"Container", "PersistentVolumeClaim", "Pod", "openshift.io/Image", "openshift.io/ImageStream"}},
                                        },
                                        "max": {
                                                SchemaProps: spec.SchemaProps{ 

This is a clone of issue OCPBUGS-11773. The following is the description of the original issue:

Description of problem:

with new s3 bucket, hc failed with condition :
- lastTransitionTime: “2023-04-13T14:17:11Z”
   message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2
    s3 bucket: aws returned an error: AccessControlListNotSupported’
   observedGeneration: 3
   reason: OIDCConfigurationInvalid
   status: “False”
   type: ValidOIDCConfiguration

Version-Release number of selected component (if applicable):

 

How reproducible:

1 create s3 bucket 
$ aws s3api create-bucket --create-bucket-configuration  LocationConstraint=us-east-2 --region=us-east-2 --bucket heli-hypershift-demo-oidc-2
{
  "Location": "http://heli-hypershift-demo-oidc-2.s3.amazonaws.com/"
}
[cloud-user@heli-rhel-8 ~]$ aws s3api delete-public-access-block --bucket heli-hypershift-demo-oidc-2

2 install HO and create a hc on aws us-west-2
3. hc failed with condition:
- lastTransitionTime: “2023-04-13T14:17:11Z”    message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2     s3 bucket: aws returned an error: AccessControlListNotSupported’    observedGeneration: 3    reason: OIDCConfigurationInvalid    status: “False”    type: ValidOIDCConfiguration

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

create a hc successfully

Additional info:

 

Description of problem:

After the enabling the FIPS in S390x , the ingress controller is repeatedly going into the degraded state. However the observation here is the ingress controller is in running state after a few failure, but it keep recreating the pod and the operator status showing as degraded.

Version-Release number of selected component (if applicable):

OCP Version: 4.11.0-rc.2

How reproducible:

Enable FIPS: True in image-config file 

Steps to Reproduce:
1. Enable FIPS: True in image-config file before the installation.
2.
3. oc get co

Actual results:

 oc get co

NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

authentication                             4.11.0-rc.2   True        False         False      7h29m   

baremetal                                  4.11.0-rc.2   True        False         False      4d12h   

cloud-controller-manager                   4.11.0-rc.2   True        False         False      4d12h   

cloud-credential                           4.11.0-rc.2   True        False         False      4d12h   

cluster-autoscaler                         4.11.0-rc.2   True        False         False      4d12h   

config-operator                            4.11.0-rc.2   True        False         False      4d12h   

console                                    4.11.0-rc.2   True        False         False      4d11h   

csi-snapshot-controller                    4.11.0-rc.2   True        False         False      4d12h   

dns                                        4.11.0-rc.2   True        False         False      4d12h   

etcd                                       4.11.0-rc.2   True        False         False      4d11h   

image-registry                             4.11.0-rc.2   True        False         False      4d11h   

ingress                                    4.11.0-rc.2   True        False         True       4d11h   The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-84689cdc5f-r87hs" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-r87hs": pod router-default-84689cdc5f-r87hs is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-8z2fh" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-8z2fh": pod router-default-84689cdc5f-8z2fh is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-s7z96" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-s7z96": pod router-default-84689cdc5f-s7z96 is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-hslhn" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-hslhn": pod router-default-84689cdc5f-hslhn is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-nf9vt" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-nf9vt": pod router-default-84689cdc5f-nf9vt is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-mslzf" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-mslzf": pod router-default-84689cdc5f-mslzf is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-mc8th" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-mc8th": pod router-default-84689cdc5f-mc8th is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe")

insights                                   4.11.0-rc.2   True        False         False      4d12h   

kube-apiserver                             4.11.0-rc.2   True        False         False      4d11h   

kube-controller-manager                    4.11.0-rc.2   True        False         False      4d12h   

kube-scheduler                             4.11.0-rc.2   True        False         False      4d12h   

kube-storage-version-migrator              4.11.0-rc.2   True        False         False      4d11h   

machine-api                                4.11.0-rc.2   True        False         False      4d12h   

machine-approver                           4.11.0-rc.2   True        False         False      4d12h   

machine-config                             4.11.0-rc.2   True        False         False      4d12h   

marketplace                                4.11.0-rc.2   True        False         False      4d12h   

monitoring                                 4.11.0-rc.2   True        False         False      4d11h   

network                                    4.11.0-rc.2   True        False         False      4d12h   

node-tuning                                4.11.0-rc.2   True        False         False      4d11h   

openshift-apiserver                        4.11.0-rc.2   True        False         False      4d11h   

openshift-controller-manager               4.11.0-rc.2   True        False         False      4d12h   

openshift-samples                          4.11.0-rc.2   True        False         False      4d11h   

operator-lifecycle-manager                 4.11.0-rc.2   True        False         False      4d12h   

operator-lifecycle-manager-catalog         4.11.0-rc.2   True        False         False      4d12h   

operator-lifecycle-manager-packageserver   4.11.0-rc.2   True        False         False      4d11h   

service-ca                                 4.11.0-rc.2   True        False         False      4d12h   

storage                                    4.11.0-rc.2   True        False         False      4d12h   

 

Expected results:

oc get co

NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE

authentication                             4.11.0-rc.2   True        False         False      9d      

baremetal                                  4.11.0-rc.2   True        False         False      13d     

cloud-controller-manager                   4.11.0-rc.2   True        False         False      13d     

cloud-credential                           4.11.0-rc.2   True        False         False      13d     

cluster-autoscaler                         4.11.0-rc.2   True        False         False      13d     

config-operator                            4.11.0-rc.2   True        False         False      13d     

console                                    4.11.0-rc.2   True        False         False      13d     

csi-snapshot-controller                    4.11.0-rc.2   True        False         False      13d     

dns                                        4.11.0-rc.2   True        False         False      13d     

etcd                                       4.11.0-rc.2   True        False         False      13d     

image-registry                             4.11.0-rc.2   True        False         False      13d     

ingress                                    4.11.0-rc.2   True        False         False      13d     

insights                                   4.11.0-rc.2   True        False         False      13d     

kube-apiserver                             4.11.0-rc.2   True        False         False      13d     

kube-controller-manager                    4.11.0-rc.2   True        False         False      13d     

kube-scheduler                             4.11.0-rc.2   True        False         False      13d     

kube-storage-version-migrator              4.11.0-rc.2   True        False         False      13d     

machine-api                                4.11.0-rc.2   True        False         False      13d     

machine-approver                           4.11.0-rc.2   True        False         False      13d     

machine-config                             4.11.0-rc.2   True        False         False      13d     

marketplace                                4.11.0-rc.2   True        False         False      13d     

monitoring                                 4.11.0-rc.2   True        False         False      13d     

network                                    4.11.0-rc.2   True        False         False      13d     

node-tuning                                4.11.0-rc.2   True        False         False      13d     

openshift-apiserver                        4.11.0-rc.2   True        False         False      13d     

openshift-controller-manager               4.11.0-rc.2   True        False         False      13d     

openshift-samples                          4.11.0-rc.2   True        False         False      13d     

operator-lifecycle-manager                 4.11.0-rc.2   True        False         False      13d     

operator-lifecycle-manager-catalog         4.11.0-rc.2   True        False         False      13d     

operator-lifecycle-manager-packageserver   4.11.0-rc.2   True        False         False      13d     

service-ca                                 4.11.0-rc.2   True        False         False      13d     

storage                                    4.11.0-rc.2   True        False         False      13d     

 

Additional info:

Attached the Running ingress controller logs.

The failed ingress controller pod is repeatedly creating in openshift-ingress namespaces.

looks like two ingress controller pod is in running state, but the other failed pods were not cleaned up. So manually delete the failed pods fixed the issue.

 

  1. oc get pods -n openshift-ingress | wc -l

451

 

  1. oc get pods -n openshift-ingress | grep Running

router-default-84689cdc5f-9j44t   1/1     Running     4 (4d12h ago)   4d12h

router-default-84689cdc5f-qn4gh   1/1     Running     3 (4d12h ago)   4d12h

 

  1. oc get pods -n openshift-ingress | grep -v Running | wc -l

449

Description of problem:

When performing the `Uninstall Operator` (admins) or `Delete ClusterServiceVersion` (regular users) actions, the `Uninstall Operator?` modal instructions text references a checkbox (e.g., `Select the checkbox below to also remove all Operands associated with this Operator.`) that may not be present in the modal if no operands exist *or* the CSV has already been deleted (which occurs when performing `Remove Subscription` in the case where `Delete ClusterServiceVersion` was performed by an regular user and a cluster admin needs to clean up the orphaned subscription).

Version-Release number of selected component (if applicable):

4.13.0

Steps to Reproduce:

1.  Install the Argo CD operator via OperatorHub in all namespaces
2.  Uninstall the Argo CD operator via the `Uninstall Operator` action.
3.  Note the modal text `Select the checkbox below to also remove all Operands associated with this Operator.` refers to a checkbox that doesn't exist

or

1. Login as a regular user and create a project `test`
2. Login as an administrator and install the Argo CD operator via OperatorHub in namespace `test`
3. Login as a regular user and remove the operator by using option "Delete ClusterServiceVersion" action
4. Login as an administrator and delete the orphaned Argo CD Subscription.
5. Note the modal text `Select the checkbox below to also remove all Operands associated with this Operator.` refers to a checkbox that doesn't exist

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

This bug is a result of analysis of jira TRT-735. In all the cases analyzed, failures were transient. But MCDPivotError alert was latched for 15m and resulted in test failures.

This search will give you all the jobs that has this firing: https://search.ci.openshift.org/?search=MCDPivotError.*firing&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Here is a link to slack discussion: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1672774860494109

Typically pivot error is caused by networking issues rpm-ostree encounters when performing a txn rebase. Connections to the registry could fail for different reasons. But the end result is that mcd_pivot_errors_total metrics is incremented whenever such an error occurs. Based on the definition for the alert here:https://github.com/openshift/machine-config-operator/blob/bab235d09cc3b9e6cf7a9b9149817fdb1c5e3649/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L76, we are firing the alert whenever such an error occurs and it will last 15m. Yet, in most of the cases we analyzed, this error were transient and a retry (within seconds) corrected the problem.

 

Here are a few questions:

  1. If we expect transient errors like this and a follow-up retry will correct the issue within a minute, should we wait for some time (a minute?) to fire this alert?
  2. Depending on the retry logic, we might need to revise the alert definition. For example, if we expect a constant retry interval (within a minute), we can still use the same definition, just to lower the latch from 15m to something much smaller. Since we are retrying at least one time within the last minute, it is guaranteed this value will keep incrementing in real errored condition.
  3. Yet if we are using an exponentially increasing retry interval, we will need something else to really trigger the alert. @wking has something suggestions in the slack thread that might work in this case. But that means we will need to add more metrics to achieve the goal.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
The project list orders projects by its name and is smart enough to keep a "numerical order" like:

  1. test-1
  2. test-2
  3. test-11

The more prominent project dropdown is not so smart and shows just a simple "ascii ordered" list:

  1. test-1
  2. test-11
  3. test-2

Version-Release number of selected component (if applicable):
4.8-4.13 (master)

How reproducible:
Always

Steps to Reproduce:
1. Create some new projects called test-1, test-11, test-2
2. Check the project list page (in admin perspective)
3. Check the project dropdown (in dev perspective)

Actual results:
Order is

  1. test-1
  2. test-11
  3. test-2

Expected results:
Order should be

  1. test-1
  2. test-2
  3. test-11

Additional info:
none

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/338

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-12165. The following is the description of the original issue:

Description of problem:

While updating a cluster to 4.12.11, which contains the bug fix for [OCPBUGS-7999|https://issues.redhat.com/browse/OCPBUGS-7999] (which is the 4.12.z backport of [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783], it seems that the older {{{Custom|Default}RouteSync{Degraded|Progressing}}} conditions are not cleaned up as they should, as per [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] resolution, while the newer ones are added.

Due to this, on an upgrade to 4.12.11 (or higher, until this bug is fixed), it is possible to hit a problem very similar to the one that lead to [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] in the first place, but while upgrading to 4.12.11.

So, we need to do a proper cleanup of the older conditions.

Version-Release number of selected component (if applicable):

4.12.11 and higher

How reproducible:

Always in what regards the wrong conditions. It only leads to issues if one of the wrong conditions was in unhealthy state.

Steps to Reproduce:

1. Upgrade
2.
3.

Actual results:

Both new (and correct) conditions plus older (and wrong) conditions.

Expected results:

Both new (and correct) conditions only.

Additional info:

Problem seems to be that the stale conditions controller is created[1] with a list that says {{CustomRouteSync}} and {{DefaultRouteSync}}, while that list should be {{CustomRouteSyncDegraded}}, {{CustomRouteSyncProgressing}}, {{DefaultRouteSyncDegraded}} and {{DefaultRouteSyncProgressing}}. I read the source code of the controller a bit and it seems that it does not admit prefixes but performs a literal comparison.

[1] - https://github.com/openshift/console-operator/blob/0b54727/pkg/console/starter/starter.go#L403-L404

Description of problem:

I1102 14:25:27.816713       1 job_controller.go:1507] Failed creation, decrementing expectations for job "assisted-installer"/"assisted-installer-controller"
E1102 14:25:27.816729       1 job_controller.go:1512] pods "assisted-installer-controller-vmmw7" is forbidden: violates PodSecurity "restricted:v1.24": host namespaces (hostNetwork=true), allowPrivilegeEscalation != false (container "assisted-installer-controller" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "assisted-installer-controller" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "service-ca-cert-config" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "assisted-installer-controller" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "assisted-installer-controller" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
E1102 14:25:27.816750       1 job_controller.go:545] syncing job: pods "assisted-installer-controller-vmmw7" is forbidden: violates PodSecurity "restricted:v1.24": host namespaces (hostNetwork=true), allowPrivilegeEscalation != false (container "assisted-installer-controller" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "assisted-installer-controller" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "service-ca-cert-config" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "assisted-installer-controller" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "assisted-installer-controller" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
I1102 14:25:27.816806       1 event.go:294] "Event occurred" object="assisted-installer/assisted-installer-controller" fieldPath="" kind="Job" apiVersion="batch/v1" type="Warning" reason="FailedCreate" message="Error creating: pods \"assisted-installer-controller-vmmw7\" is forbidden: violates PodSecurity \"restricted:v1.24\": host namespaces (hostNetwork=true), allowPrivilegeEscalation != false (container \"assisted-installer-controller\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"assisted-installer-controller\" must set securityContext.capabilities.drop=[\"ALL\"]), restricted volume types (volume \"service-ca-cert-config\" uses restricted volume type \"hostPath\"), runAsNonRoot != true (pod or container \"assisted-installer-controller\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"assisted-installer-controller\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Run the assisted installer ([~jacding@redhat.com] for more detailed description)

Actual results:

assisted-installer-controller job pod fails to be created due to PodSecurity violations

Expected results:

assisted-installer-controller job pod is created

Additional info:

Forked from https://issues.redhat.com/browse/OCPBUGS-2311

Either set the proper securityContext in the job manifest or label the `assisted-installer` ns as privileged.

This is a clone of issue OCPBUGS-7989. The following is the description of the original issue:

Description of problem:

ControlPlaneMachineSet Machines are considered Ready once the underlying MAPI machine is Running.
This should not be a sufficient condition, as the Node linked to that Machine should also be Ready for the overall CPMS Machine to be considered Ready.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Description of problem:

When booting a host with iPXE using a baremetalhost and preprovisioning image existing DNS settings are not available to the host during the boot process.

Specifically attempting to set the coreos.live.rootfs_url parameter to a URL using a hostname of a route in the hub cluster fails the boot even though that hostname is resolvable in the rest of the network.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-26-111919

How reproducible:

100%

Steps to Reproduce:

1. Create a BMH and preprovisioningimage to boot RHCOS using iPXE
2. Set coreos.live.rootfs_url kernel parameter to a URL on an otherwise resolvable host (in my case a route to an application on the hub cluster)

Actual results:

Host fails to boot with DNS failure messages in the console

Expected results:

Host boots successfully with no additional user configuration.
Specifically I don't want the user to have to re-specify any DNS configuration that is available in the network already.

Additional info:

My particular use case is ZTP with assisted installer and ran into this using dev-scripts.
In this situation a controller within the assisted-service is setting up the preprovisioning image parameters and we're serving the rootfs over a route in the hub cluster.

I can provide more information about this specific case if needed, but the issue feels like it applies generally.

Description of problem:

Resource type drop-down menu item 'Last used' is in English

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Navigate to kube:admin -> User Preferences -> Applications
2. Click on Resource type dorp-down

Actual results:

Content is in English

Expected results:

Content should be in target language

Additional info:

Screenshot reference provided

Description of problem:

When running ZTP to install SNO, there is an script called /usr/local/bin/agent-fix-bz1964591 that forces the removal of an assisted service container image if it is available locally. 

This causes the precache tool to always pull the container image from a registry even if the image is already available locally.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Set up the precache ZTP flow by downloading the proper artifacts
2. Configure the siteconfig with the proper ZTP configuration for precaching. That includes a couple of ignitionOverrides to modify the discovery ISO.
2.
3.

Actual results:

The assisted installer agent image is always removed and then pulled from network

Expected results:

Since the agent container image is already available locally, there is no need to be pulled. The removal of this container image should be optional.

Additional info:

There is a workaround, which is basically override the script that removes the container image. This is done via an ignitionConfigOverrides. Example:

https://gitlab.cee.redhat.com/sysdeseng/5g-ericsson/-/blob/cnf-integration/demos/ztp-policygen/site-configs/clus3a-t5g-lab.yaml#L31

The script ends up being modified and looks like:

data:,#!/usr/bin/sh# This script is a workaround for bugzilla 1964591 where symlinks inside /var/lib/containers/ get
# corrupted under some circumstances.
#
# In order to let agent.service start correctly we are checking here whether the requested
# container image exists and in case "podman images" returns an error we try removing the faulty
# image.
#
# In such a scenario agent.service will detect the image is not present and pull it again. In case
# the image is present and can be detected correctly, no any action is required.IMAGE=$(echo $1 | sed 's/:.*//')
podman image exists $1 || echo "already loaded" || echo "need to be pulled"
#podman images | grep $IMAGE || podman rmi --force $1 || true

While doing a PerfScale test of we noticed that the ovnkube pods are not being spread out evenly among the available workers. Instead they are all stacking on a few until they fill up the available allocatable ebs volumes (25 in the case of m5 instances that we see here).

An example from partway through our 80 hosted cluster test when there were ~30 hosted clusters created/in progress

There are 24 workers available:

```

$ for i in `oc get nodes l node-role.kubernetes.io/worker=,node-role.kubernetes.io/infra!=,node-role.kubernetes.io/workload!= | egrep -v "NAME" | awk '{ print $1 }'`;    do  echo $i `oc describe node $i | grep -v openshift | grep ovnkube -c`; done
ip-10-0-129-227.us-west-2.compute.internal 0
ip-10-0-136-22.us-west-2.compute.internal 25
ip-10-0-136-29.us-west-2.compute.internal 0
ip-10-0-147-248.us-west-2.compute.internal 0
ip-10-0-150-147.us-west-2.compute.internal 0
ip-10-0-154-207.us-west-2.compute.internal 0
ip-10-0-156-0.us-west-2.compute.internal 0
ip-10-0-157-1.us-west-2.compute.internal 4
ip-10-0-160-253.us-west-2.compute.internal 0
ip-10-0-161-30.us-west-2.compute.internal 0
ip-10-0-164-98.us-west-2.compute.internal 0
ip-10-0-168-245.us-west-2.compute.internal 0
ip-10-0-170-103.us-west-2.compute.internal 0
ip-10-0-188-169.us-west-2.compute.internal 25
ip-10-0-188-194.us-west-2.compute.internal 0
ip-10-0-191-51.us-west-2.compute.internal 5
ip-10-0-192-10.us-west-2.compute.internal 0
ip-10-0-193-200.us-west-2.compute.internal 0
ip-10-0-193-27.us-west-2.compute.internal 7
ip-10-0-199-1.us-west-2.compute.internal 0
ip-10-0-203-161.us-west-2.compute.internal 0
ip-10-0-204-40.us-west-2.compute.internal 23
ip-10-0-220-164.us-west-2.compute.internal 0
ip-10-0-222-59.us-west-2.compute.internal 0

```

This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster

Description of problem:

We show the UpdateInProgress component (the progress bars) when the cluster update status is Failing, UpdatingAndFailing, or Updating.  The inclusion of the Failing case results in a bug where the progress bars can display when an update is not occurring (see attached screenshot).  

Steps to Reproduce:

1.  Add the following overrides to ClusterVersion config (/k8s/cluster/config.openshift.io~v1~ClusterVersion/version)

spec:
  overrides:
    - group: apps
      kind: Deployment
      name: console-operator
      namespace: openshift-console-operator
      unmanaged: true    
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      name: console-operator
      namespace: ''
      unmanaged: true
2.  Wait for ClusterVersion changes to roll out.
3.  Visit /settings/cluster and note the progress bars are present and displaying 100% but the cluster is not updating

Actual results:

Progress bars are displaying when not updating.

Expected results:

Progress bars should not display when updating.

 Currently our message says 

"VIP does not belong to the Machine CIDR or is already in use"

and customer don't understand what is going as they think that it only should match cidr .

Fixing it to return different error per each case

To give users sub-component granularity about why they're getting a critical alert.

We should continue to avoid the cardinality hit of including the full message in the metric, because we don't want to load Prometheus down with that many time-series. For message-level granularity, users still have to follow the oc ... or web-console links from the alert description.

A downside of this approach is that it's possible to have operators with rapidly changing ClusterOperator Available=False reason. But that seems unlikely (it only has to be stable for ~10 minutes before ClusterOperatorDown fires), and we can revisit this approach if it crops up in practice.

Description of problem:

Machine stuck in no phase when creating in a nonexistent zone and stuck in Deleting when deleting on GCP

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-12-05-155739
This can be also reproduced on older version(checked on 4.9, 4.11)

How reproducible:

Always

Steps to Reproduce:

1.Create a machineset in a nonexistent zone
Copy a default machineset, change name, and change zone to a nonexistent zone, for example, us-central1-d

liuhuali@Lius-MacBook-Pro huali-test % oc get machineset huliu-gcp413v2-r7dbx-worker-a -o yaml > ms4.yaml 
liuhuali@Lius-MacBook-Pro huali-test % vim ms4.yaml 
liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms4.yaml 
machineset.machine.openshift.io/huliu-gcp413v2-r7dbx-worker-d created
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                  PHASE      TYPE            REGION        ZONE            AGE
huliu-gcp413v2-r7dbx-master-1         Running    n2-standard-4   us-central1   us-central1-b   96m
huliu-gcp413v2-r7dbx-master-2         Running    n2-standard-4   us-central1   us-central1-c   96m
huliu-gcp413v2-r7dbx-master-65hbs-0   Running    n2-standard-4   us-central1   us-central1-f   42m
huliu-gcp413v2-r7dbx-master-n468m-1   Deleting                                                 16m
huliu-gcp413v2-r7dbx-worker-a-5hdx8   Running    n2-standard-4   us-central1   us-central1-a   93m
huliu-gcp413v2-r7dbx-worker-b-l6fz7   Running    n2-standard-4   us-central1   us-central1-b   93m
huliu-gcp413v2-r7dbx-worker-c-g5m4k   Running    n2-standard-4   us-central1   us-central1-c   93m
huliu-gcp413v2-r7dbx-worker-d-kx2t4                                                            3s
liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                                    
NAME                                  PHASE      TYPE            REGION        ZONE            AGE
huliu-gcp413v2-r7dbx-master-1         Running    n2-standard-4   us-central1   us-central1-b   105m
huliu-gcp413v2-r7dbx-master-2         Running    n2-standard-4   us-central1   us-central1-c   105m
huliu-gcp413v2-r7dbx-master-65hbs-0   Running    n2-standard-4   us-central1   us-central1-f   51m
huliu-gcp413v2-r7dbx-master-n468m-1   Deleting                                                 25m
huliu-gcp413v2-r7dbx-worker-a-5hdx8   Running    n2-standard-4   us-central1   us-central1-a   102m
huliu-gcp413v2-r7dbx-worker-b-l6fz7   Running    n2-standard-4   us-central1   us-central1-b   102m
huliu-gcp413v2-r7dbx-worker-c-g5m4k   Running    n2-standard-4   us-central1   us-central1-c   102m
huliu-gcp413v2-r7dbx-worker-d-kx2t4                                                            9m5s

2.Delete the machineset
liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-gcp413v2-r7dbx-worker-d
machineset.machine.openshift.io "huliu-gcp413v2-r7dbx-worker-d" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                  PHASE      TYPE            REGION        ZONE            AGE
huliu-gcp413v2-r7dbx-master-1         Running    n2-standard-4   us-central1   us-central1-b   105m
huliu-gcp413v2-r7dbx-master-2         Running    n2-standard-4   us-central1   us-central1-c   105m
huliu-gcp413v2-r7dbx-master-65hbs-0   Running    n2-standard-4   us-central1   us-central1-f   51m
huliu-gcp413v2-r7dbx-master-n468m-1   Deleting                                                 26m
huliu-gcp413v2-r7dbx-worker-a-5hdx8   Running    n2-standard-4   us-central1   us-central1-a   102m
huliu-gcp413v2-r7dbx-worker-b-l6fz7   Running    n2-standard-4   us-central1   us-central1-b   102m
huliu-gcp413v2-r7dbx-worker-c-g5m4k   Running    n2-standard-4   us-central1   us-central1-c   102m
huliu-gcp413v2-r7dbx-worker-d-kx2t4   Deleting                                                 9m21s
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                  PHASE      TYPE            REGION        ZONE            AGE
huliu-gcp413v2-r7dbx-master-1         Running    n2-standard-4   us-central1   us-central1-b   3h4m
huliu-gcp413v2-r7dbx-master-2         Running    n2-standard-4   us-central1   us-central1-c   3h4m
huliu-gcp413v2-r7dbx-master-65hbs-0   Running    n2-standard-4   us-central1   us-central1-f   130m
huliu-gcp413v2-r7dbx-master-n468m-1   Deleting                                                 105m
huliu-gcp413v2-r7dbx-worker-a-5hdx8   Running    n2-standard-4   us-central1   us-central1-a   3h1m
huliu-gcp413v2-r7dbx-worker-b-l6fz7   Running    n2-standard-4   us-central1   us-central1-b   3h1m
huliu-gcp413v2-r7dbx-worker-c-g5m4k   Running    n2-standard-4   us-central1   us-central1-c   3h1m
huliu-gcp413v2-r7dbx-worker-d-kx2t4   Deleting                                                 88m

Some machine-controller logs:
I1207 07:59:05.395164       1 actuator.go:138] huliu-gcp413v2-r7dbx-worker-d-kx2t4: Deleting machine
E1207 07:59:05.521660       1 actuator.go:53] huliu-gcp413v2-r7dbx-worker-d-kx2t4 error: huliu-gcp413v2-r7dbx-worker-d-kx2t4: reconciler failed to Delete machine: huliu-gcp413v2-r7dbx-worker-d-kx2t4: Machine does not exist
I1207 07:59:05.521708       1 controller.go:422] Actuator returned invalid configuration error: huliu-gcp413v2-r7dbx-worker-d-kx2t4: Machine does not exist
I1207 07:59:05.521714       1 actuator.go:84] huliu-gcp413v2-r7dbx-worker-d-kx2t4: Checking if machine exists
I1207 07:59:05.521849       1 recorder.go:103] events "msg"="huliu-gcp413v2-r7dbx-worker-d-kx2t4: reconciler failed to Delete machine: huliu-gcp413v2-r7dbx-worker-d-kx2t4: Machine does not exist" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-gcp413v2-r7dbx-worker-d-kx2t4","uid":"88a9f385-3350-4ddf-a451-e3603928f5d1","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"66351"} "reason"="FailedDelete" "type"="Warning"
E1207 07:59:05.620961       1 controller.go:262] huliu-gcp413v2-r7dbx-worker-d-kx2t4: failed to check if machine exists: huliu-gcp413v2-r7dbx-worker-d-kx2t4: Machine does not exist
E1207 07:59:05.621040       1 controller.go:326]  "msg"="Reconciler error" "error"="huliu-gcp413v2-r7dbx-worker-d-kx2t4: Machine does not exist" "controller"="machine-controller" "name"="huliu-gcp413v2-r7dbx-worker-d-kx2t4" "namespace"="openshift-machine-api" "object"={"name":"huliu-gcp413v2-r7dbx-worker-d-kx2t4","namespace":"openshift-machine-api"} "reconcileID"="8f8cb8e9-3757-4646-b579-8aa7f0974949"

Actual results:

Machine stuck in no phase when creating in a nonexistent zone, machine stuck in Deleting when deleting

Expected results:

Machine go into Failed phase when creating in a nonexistent zone, machine can be deleted successfully when deleting

Additional info:

Must-gather https://drive.google.com/file/d/1cUarMzvLPQToatAv4OgsjOvuo1udpUHs/view?usp=sharing

This case works as expected on AWS and Azure.

while upgrading openshift from 4.11 to 4.12, there is a temporary error condition visible in the cli as well as in cvo pod

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.12.0-ec.3: 703 of 830 done (84% complete), waiting up to 40 minutes on machine-config

Upgradeable=False

  Reason: UnableToGetClusterVersion
  Message: Unable to get ClusterVersion, err=clusterversion.config.openshift.io "version" not found.
 

cvo log:

E0915 21:43:12.739847       1 upgradeable.go:312] Unable to get ClusterVersion, err=clusterversion.config.openshift.io "version" not found.

How reproducible:

3 of 3

Steps to Reproduce:

1. upgrade 4.11 to 4.12
2. observe oc adm upgrade output, as well as cvo log

Description of problem:

This PR fails HyperShift CI fails with:

=== RUN TestAutoscaling/EnsureNoPodsWithTooHighPriority
util.go:411: pod csi-snapshot-controller-7bb4b877b4-q5457 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000
util.go:411: pod csi-snapshot-webhook-644b6dbfb-v4lj7 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000

How reproducible:

always

Steps to Reproduce:

  1. Install HyperShift + create a guest cluster with CSI Snapshot Controller and/or Cluster Storage Operator / AWS EBS CSI driver operator running in the HyperShift managed cluster
  2. Check priorityClass of the guest control plane pods in the hosted cluster.

Alternatively, ci/prow/e2e-aws in https://github.com/openshift/hypershift/pull/1698 and https://github.com/openshift/hypershift/pull/1748 must pass.

Description of problem:

The subscription URL for any operator redirects to details of operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

N/A

Steps to Reproduce:

1. Login to Console 
2. Check any operator for subscription Ex.: https://console-openshift-console.apps.hawking-upgrade-closeout.cp.fyre.ibm.com/k8s/ns/cp4i/operators.coreos.com~v1alpha1~ClusterServiceVersion/ibm-apiconnect.v2.5.1/subscription
3. It takes to details tab of operator 

Actual results:

URL takes time to load or shows details of operator

Expected results:

Subscription details of operator should be shown

Additional info:

 

Description of problem:

Set the new feature for enabling/disabling firewall rules in GCP as tech preview

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

This is a clone of issue OCPBUGS-2130. The following is the description of the original issue:

Description of problem:

create new host and cluster folder qe-cluster under datacenter, and move cluster workloads into that folder.

$ govc find -type r
/OCP-DC/host/qe-cluster/workloads

using below install-config.yaml file to create single zone cluster.

apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: 
    vsphere:
      cpus: 4
      memoryMB: 8192
      osDisk:
        diskSizeGB: 60
      zones:
        - us-east-1
  replicas: 2
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere: 
      cpus: 4
      memoryMB: 16384 
      osDisk:
        diskSizeGB: 60
      zones:
        - us-east-1
  replicas: 3
metadata:
  name: jima-permission
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.19.46.0/24
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    apiVIP: 10.19.46.99
    cluster: qe-cluster/workloads
    datacenter: OCP-DC
    defaultDatastore: my-nfs
    ingressVIP: 10.19.46.98
    network: "VM Network"
    username: administrator@vsphere.local
    password: xxx
    vCenter: xxx
    vcenters:
    - server: xxx
      user: administrator@vsphere.local
      password: xxx
      datacenters:
      - OCP-DC
    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      topology:
        datacenter: OCP-DC
        computeCluster: /OCP-DC/host/qe-cluster/workloads
        networks:
        - "VM Network"
        datastore: my-nfs
      server: xxx
pullSecret: xxx 

installer get error:

$ ./openshift-install create cluster --dir ipi5 --log-level debug
DEBUG   Generating Platform Provisioning Check...  
DEBUG   Fetching Common Manifests...               
DEBUG   Reusing previously-fetched Common Manifests 
DEBUG Generating Terraform Variables...            
FATAL failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get vSphere network ID: could not find vSphere cluster at /OCP-DC/host//OCP-DC/host/qe-cluster/workloads: cluster '/OCP-DC/host//OCP-DC/host/qe-cluster/workloads' not found 
 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

always

Steps to Reproduce:

1. create new host/cluster folder under datacenter, and move vsphere cluster into that folder
2. prepare install-config with zone configuration
3. deploy cluster

Actual results:

fail to create cluster

Expected results:

succeed to create cluster

Additional info:

 

 

 

 

 

This is a clone of issue OCPBUGS-7015. The following is the description of the original issue:

Description of problem:

fail to create vSphere 4.12.2 IPI cluster as apiVIP and ingressVIP are not in machine networks

# ./openshift-install create cluster --dir=/tmp
? SSH Public Key /root/.ssh/id_rsa.pub
? Platform vsphere
? vCenter vcenter.vmware.gsslab.pnq2.redhat.com
? Username administrator@gsslab.pnq
? Password [? for help] ************
INFO Connecting to vCenter vcenter.vmware.gsslab.pnq2.redhat.com
INFO Defaulting to only available datacenter: OpenShift-DC
INFO Defaulting to only available cluster: OCP
? Default Datastore OCP-PNQ-Datastore
? Network PNQ2-25G-PUBLIC-PG
? Virtual IP Address for API [? for help] 192.168.1.10
X Sorry, your reply was invalid: IP expected to be in one of the machine networks: 10.0.0.0/16
? Virtual IP Address for API [? for help]


As the user could not define cidr for machineNetwork when creating the cluster or install-config file interactively, it will use default value 10.0.0.0/16, so fail to create the cluster ot install-config when inputting apiVIP and ingressVIP outside of default machinenNetwork.

Error is thrown from https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L655-L666, seems new function introduced from PR https://github.com/openshift/installer/pull/5798

The issue should also impact Nutanix platform.

I don't understand why the installer is expecting/validating VIPs from 10.0.0.0/16 machine network by default when it's not evening asking to input the machine networks during the survey. This validation was not mandatory in previous OCP installers.


 

Version-Release number of selected component (if applicable):

# ./openshift-install version
./openshift-install 4.12.2
built from commit 7fea1c4fc00312fdf91df361b4ec1a1a12288a97
release image quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. create install-config.yaml file by running command "./openshift-install create install-config --dir ipi"
2. failed with above error

Actual results:

fail to create install-config.yaml file

Expected results:

succeed to create install-config.yaml file

Additional info:

 The current workaround is to use dummy VIPs from 10.0.0.0/16 machinenetwork to create the install-config first and then modify the machinenetwork and VIPs as per your requirement which is overhead and creates a negative experience.


There was already a bug reported which seems to have only fixed the VIP validation: https://issues.redhat.com/browse/OCPBUGS-881
 

Description of problem:

When used in heads-only mode, oc-mirror does not record the operator bundles minimum version if a target name is set.

The record values ensures that bundles that still exist in the catalog are included as part of the generated catalog and that the associated images are not pruned. This behavior will prune bundles that have when no minimum version is set in the imageset configuration and the bundles still exist in the source catalog.

Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202212011938.p0.g8bf1402.assembly.stream-8bf1402", GitCommit:"8bf14023aa018e12425e29993e6f53f0ab07e6ab", GitTreeState:"clean", BuildDate:"2022-12-01T19:56:31Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

Using the advanced cluster management package as an example.

1. Find the latest bundle for acm in the release-2.6 channel with oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10-1663021232 --package advanced-cluster-management
2. Create an mirror set configuration to mirror an operator from an older catalog version

apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
storageConfig:
  local:
    path: test
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10-1663021232
      targetName: test
      targetTag: test
      packages:
        - name: advanced-cluster-management
          channels:
            - name: release-2.6


2. Run oc-mirror --config config-with-operators.yaml file://
3. Check the bundle minimum version on the metadata using oc-mirror describe mirror_seq1_000000.tar under the field operators, the advanced-cluster-management should show version found in Step 1.
4. Create another ImageSetConfiguration for a later version of the catalog
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
storageConfig:
  local:
    path: test 
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
      targetName: test
      targetTag: test
      packages:
        - name: advanced-cluster-management
          channels:
            - name: release-2.6
4. Check the bundle minimum version on the metadata using oc-mirror describe mirror_seq2_000000.tar under the operators field. 

Actual results:

The catalog entry in the metadata shows packages as null.

Expected results:

It should have the advanced-cluster-managament package with the original minimum version or an updated minimum version if the original bundle was pruned.
 

 

Description of problem:

Observed inconsistency between Project and Namespace. Label added for the project when entered empty string. However the label or annotate with a string value to a Namespace or project doesn't work.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a project : `$ oc create namespace testproject`
2. Assign admin access to user : `$ oc adm policy add-role-to-user admin testuser -n testproject`
3. Login with the user : `$ oc login -u testuser`
4. Label ns : `$ oc label ns testproject test1=label1`
5. Annotate ns : `$ oc annotate ns testproject openshift.io/node-selector=test2=test2`
6. Label project : `$ oc label project testproject test1=label1` 
7. Annotate project : `$ oc annotate project testproject openshift.io/node-selector=test2=test2`
8. Label project (this specific label worked): `$ oc label project testproject policy-group.network.openshift.io/ingress=""`

Actual results:

When a new label with non-empty string is added to a project, validation error will occur . And when a new label with empty string is added to a project, the project is labeled :
~~~
$ oc label project testproject policy-group.network.openshift.io/ingress=""` project.project.openshift.io/testproject labeled
~~~

Expected results:

 

Additional info:

The project Admin does not have access to modify the project/namespace resource itself.

Label ns : `$ oc label ns testproject test1=label1`
~~~
Error from server (Forbidden): namespaces "testproject" is forbidden: User "testuser" cannot patch resource "namespaces" in API group "" in the namespace "testproject"
~~~
Annotate ns : `$ oc annotate ns testproject openshift.io/node-selector=test2=test2`
~~~
Error from server (Forbidden): namespaces "testproject" is forbidden: User "testuser" cannot patch resource "namespaces" in API group "" in the namespace "testproject"
~~~

Label project : `$ oc label project testproject test1=label1`
~~~
The Project "testproject" is invalid: metadata.labels[test1]: Invalid value: "label1": field is immutable, , try updating the namespace
~~~ 
Annotate project : `$ oc annotate project testproject openshift.io/node-selector=test2=test2`
~~~   
The Project "testproject" is invalid: metadata.annotations[openshift.io/node-selector]: Invalid value: "test2=test2": field is immutable, try updating the namespace 
~~~

However when tried with empty string : It worked :
~~~
$ oc label project testproject policy-group.network.openshift.io/ingress=""
project.project.openshift.io/testproject labeled
~~~

 

This is a clone of issue OCPBUGS-11550. The following is the description of the original issue:

Description of problem:

`cluster-reader` ClusterRole should have ["get", "list", "watch"] permissions for a number of privileged CRs, but lacks them for the API Group "k8s.ovn.org", which includes CRs such as EgressFirewalls, EgressIPs, etc.

Version-Release number of selected component (if applicable):

OCP 4.10 - 4.12 OVN

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster with OVN components, e.g. EgressFirewall
2. Check permissions of ClusterRole `cluster-reader`

Actual results:

No permissions for OVN resources 

Expected results:

Get, list, and watch verb permissions for OVN resources

Additional info:

Looks like a similar bug was opened for "network-attachment-definitions" in OCPBUGS-6959 (whose closure is being contested).

Description of problem:

Azure provider sets the InstanceCreated Condition in ProviderStatus to false instead of true when machine provisioning fails.

How reproducible:

Always

Steps to Reproduce:

1. Run cluster on Azure
2. Create a machine that will fail to provision (e.g. use invalid VMSize)

Actual results:

Machine has Machine.Status.ProviderStatus.Conditions InstanceCreated condition status set to True

Expected results:

Machine has Machine.Status.ProviderStatus.Conditions InstanceCreated condition status set to False

Additional info:

 

An upstream partial fix to logging means that the BMO log now contains a mixture of structured and unstructured logs, making it impossible to read with the structured log parsing tool (bmo-log-parse) we use for debugging customer issues.
This is fixed upstream by https://github.com/metal3-io/baremetal-operator/pull/1249, which will get picked up automatically in 4.14 but which needs to be backported to 4.13.

Description of problem:

Default values for Scaling fields is not set in Create Serverless function form

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Install Red Hat Serverless operator and create knative Eventing and Serving CRs

Steps to Reproduce:

1. Navigate to Create Serverless function form
2. Enter Git URL https://github.com/vikram-raj/hello-func-node
3. Open the Advanced option Scaling

Actual results:

All the fields value set to 0

Expected results:

All the fields value should not set to 0

Additional info:

 

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Knative Service and delete it again with enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the secret "$name-github-webhook-secret" is not deleted.

When the user tries to create the same Knative Service again this fails with an error:

An error occurred
secrets "nodeinfo-github-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5548)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Serverless operator (tested with 1.26.0)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. In the topology select the Knative Service > "Delete Service" (not Delete App)

Actual results:
Deleted resources:

  1. Knative Service (tries it twice!) $name
  2. ImageStream $name
  3. BuildConfig $name
  4. Secret $name-generic-webhook-secret

Expected results:
Should also remove this resource

  1. Delete Knative Service should be called just once
  2. Secret $name-github-webhook-secret

Additional info:
When delete the whole application all the resources are deleted correctly (and just once)!

  1. Knative Service (just once!) $name
  2. ImageStream $name
  3. BuildConfig $name
  4. Secret $name-generic-webhook-secret
  5. Secret $name-github-webhook-secret

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3450

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-12732. The following is the description of the original issue:

Description of problem:

Create BuildConfig button in the Dev console builds opens the form view but in default namespace

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Goto Dev Perspective
2. Click on Builds
3. Click on "Create BuildConfig"

Actual results:

"default" namespace is selected in the namespace selector

Expected results:

It should open the form in the active namespace

Additional info:

 

Version:
$ openshift-install version
openshift-install 4.10.0-0.nightly-2021-12-23-153012
built from commit 94a3ed9cbe4db66dc50dab8b85d2abf40fb56426
release image registry.ci.openshift.org/ocp/release@sha256:39cacdae6214efce10005054fb492f02d26b59fe9d23686dc17ec8a42f428534
release architecture amd64

Platform: alibabacloud

Please specify:

  • IPI (automated install with `openshift-install`. If you don't know, then it's IPI)

What happened?
Unexpected error of 'Internal publish strategy is not supported on "alibabacloud" platform', because Internal publish strategy should be supported for "alibabacloud", please clarify otherwise, thanks!

$ openshift-install create install-config --dir work
? SSH Public Key /home/jiwei/.ssh/openshift-qe.pub
? Platform alibabacloud
? Region us-east-1
? Base Domain alicloud-qe.devcluster.openshift.com
? Cluster Name jiwei-uu
? Pull Secret [? for help] *********
INFO Install-Config created in: work
$
$ vim work/install-config.yaml
$ yq e '.publish' work/install-config.yaml
Internal
$ openshift-install create cluster --dir work --log-level info
FATAL failed to fetch Metadata: failed to load asset "Install Config": invalid "install-config.yaml" file: publish: Invalid value: "Internal": Internal publish strategy is not supported on "alibabacloud" platform
$

What did you expect to happen?
"publish: Internal" should be supported for platform "alibabacloud".

How to reproduce it (as minimally and precisely as possible)?
Always

Description of problem:

Delete task icon is not align properly on the Pipeline builder page

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Navigate to the Pipeline builder page.
2. click on Add task button and add a task
3. hover over the task and click on plus(+)
4. Hover over the Add task button and see the delete button in the corner

Actual results:

Icon in Delete task button is not align properly

Expected results:

Icon should align properly

Additional info:

 

Description of problem: I am working with a customer who uses the web console.  From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console.  This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead.  What we really need is a third column in the Project Access tab that says whether a resource is a user or group.

 

Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well

How reproducible: Every time.  My customer is running on ROSA, but I have determined this issue to be general to OpenShift.

Steps to Reproduce:

From the oc cli, I create a group and add a user to it.

$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME                                     USERS
cluster-admins                           
dedicated-admins                         admin
techlead   admin
I create a new namespace so that I can assign a group project level access:

$ oc new-project my-namespace

$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access.  I verified the rolebinding named 'edit' is bound to a group named 'techlead'.

$ oc get rolebinding
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      15m
admin-dedicated-admins                                            ClusterRole/admin                      15m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      15m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   15m
edit                                                              ClusterRole/edit                       2m18s
system:deployers                                                  ClusterRole/system:deployer            15m
system:image-builders                                             ClusterRole/system:image-builder       15m
system:image-pullers                                              ClusterRole/system:image-puller        15m

$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:16:56Z"
  name: edit
  namespace: my-namespace
  resourceVersion: "108357"
  uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: techlead
    Now, from the same Project Access tab in the web console, I added the developer with role "View".  From this web console, it is unclear whether developer and techlead are users or groups.

Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.

$ oc get rolebinding                                                                      
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      17m
admin-dedicated-admins                                            ClusterRole/admin                      17m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      17m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   17m
edit                                                              ClusterRole/edit                       4m25s
developer-view-c15b720facbc8deb     ClusterRole/view                       90s
system:deployers                                                  ClusterRole/system:deployer            17m
system:image-builders                                             ClusterRole/system:image-builder       17m
system:image-pullers                                              ClusterRole/system:image-puller        17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:19:51Z"
  name: developer-view-c15b720facbc8deb
  namespace: my-namespace
  resourceVersion: "113298"
  uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: User
      name: developer

So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups.  This is in essence our ask for this RFE.

 

Actual results:

Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them.  Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.

 

Expected results:

Should have the ability to add groups and differentiate between users and groups.  Ideally, we're looking at a third column for user or group.

 

Additional info:

Description of problem:

Prometheus fails to scrape metrics from the storage operator after some time.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Install storage operator.
2. Wait for 24h (time for the certificate to be recycled).
3.

Actual results:

Targets being down because Prometheus didn't reload the CA certificate.

Expected results:

Prometheus reloads its client TLS certificate and scrapes the target successfully.

Additional info:


Description of problem:

login administrator console UI, create silence for Watchdog alert, set "Until" time to a past time, click "Silence" button would see 

An error occurred
Bad Request

the error message is not clear, add the response to error message is more clear

"Failed to create silence: start time must be before end time"

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-08-002816

How reproducible:

always

Steps to Reproduce:

1. create silence for Watchdog alert, set "Until" time to a past time, click "Silence" button
2.
3.

Actual results:

error message is not clear

Expected results:

add the response to error message is more clear

Additional info:

 

Description of problem:

The calls to log.Debugf() from image/baseiso.go and image/oc.go are not being output when the "image create" command is run.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Every time

Steps to Reproduce:

1. Run ../bin/openshift-install agent create image --dir ./cluster-manifests/ --log-level debug

Actual results:

No debug log messages from log.Debugf() calls in pkg/asset/agent/image/oc.go

Expected results:

Debug log messages are output

Additional info:

Note from Zane: We should probably also use the real global logger instead of [creating a new one](https://github.com/openshift/installer/blob/2698cbb0ec7e96433a958ab6b864786c0c503c0b/pkg/asset/agent/image/baseiso.go#L109) with the default config that ignores the --log-level flag and prints weird `[0001]` stuff in the output for some reason. (The NMStateConfig manifests logging suffers from the same problem.)

 

 

 

Description of problem:
Pipeline Repository (Pipeline-as-code) list never shows an Event type.

Version-Release number of selected component (if applicable):
4.9+

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines Operator and setup a Pipeline-as-code repository
  2. Trigger an event and a build

Actual results:
Pipeline Repository list shows a column Event type but no value.

Expected results:
Pipeline Repository list should show the Event type from the matching Pipeline Run.

Similar to the Pipeline Run Details page based on the label.

Additional info:
The list page packages/pipelines-plugin/src/components/repository/list-page/RepositoryRow.tsx renders obj.metadata.namespace as event type.

I believe we should show the Pipeline Run event type instead. packages/pipelines-plugin/src/components/repository/RepositoryLinkList.tsx uses

{plrLabels[RepositoryLabels[RepositoryFields.EVENT_TYPE]]}

to render it.

Also the Pipeline Repository details page tried to render the Branch and Event type from the Repository resource. My research says these properties doesn't exist on the Repository resource. The code should be removed from the Repository details page.

See details from https://github.com/openshift/console/pull/12223:

subjectName can contain special characters like # which is a reserved character in a URI. This change URI encodes this parameter to avoid creating a broken link.

Before this change it was possible to create a group with a # in the name and click "Create binding" from the "RoleBindings" tab to create a broken view where the data is not correctly pre-filled.

Description of problem:

When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component:

$ oc get clusterversion                                                                              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS                                      
version   4.11.0-0.nightly-2022-12-16-190443   True        True          117m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api

$ oc get co                                                                                                                                                                                                                              
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                                                   
authentication                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h47m   
baremetal                                  4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
cloud-controller-manager                   4.12.0-rc.5                          True        False         False      5h3m    
cloud-credential                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h4m                                                                                                                                              
cluster-autoscaler                         4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
config-operator                            4.12.0-rc.5                          True        False         False      5h1m    
console                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h43m   
csi-snapshot-controller                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
dns                                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
etcd                                       4.12.0-rc.5                          True        False         False      4h58m         
image-registry                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h54m         
ingress                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m   
insights                                   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
kube-apiserver                             4.12.0-rc.5                          True        False         False      4h50m         
kube-controller-manager                    4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             
kube-scheduler                             4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             kube-storage-version-migrator              4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-api                                4.11.0-0.nightly-2022-12-16-190443   True        True          False      4h56m   Progressing towards operator: 4.12.0-rc.5                                                                                                 
machine-approver                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-config                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             marketplace                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
monitoring                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m                                                                                                                                             
network                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h3m          
node-tuning                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             
openshift-apiserver                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
openshift-controller-manager               4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h56m                                                                                                                                             
openshift-samples                          4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
operator-lifecycle-manager                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
service-ca                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
storage                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      

When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state:

$ oc get pods -n openshift-machine-api                                                                                                                                                                                                   
NAME                                           READY   STATUS             RESTARTS   AGE                                                                                                                                                                               
cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h5m                                                                                                                                                                              
cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h5m                                          
machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          94m                                                                                                                                                                               
machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          97m                                           
machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
machine-api-termination-handler-gj4pf          1/1     Running            0          4h57m                                                                                                                                                                             
machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
machine-api-termination-handler-l95x2          1/1     Running            0          4h54m                                                                                                                                                                             
machine-api-termination-handler-p6sw6          1/1     Running            0          4h57m   

$ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api                                                                                                                                                        
Name:                 machine-api-termination-handler-fcfq2
Namespace:            openshift-machine-api
Priority:             2000001000
Priority Class Name:  system-node-critical
.....................................................................
Events:
  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               94m                    default-scheduler  Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal
  Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
 exists. ","ErrorCode":2147947410}
  Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
 exists. ","ErrorCode":2147947410}
  Normal   Pulling                 93m (x4 over 94m)      kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
  Warning  Failed                  93m (x4 over 94m)      kubelet            Error: ErrImagePull
  Normal   BackOff                 4m39s (x393 over 94m)  kubelet            Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"


$ oc get pods -n openshift-machine-api -o wide
NAME                                           READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h8m    10.130.0.10    ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h8m    10.130.0.8     ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          97m     10.128.0.144   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          100m    10.128.0.143   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          97m     10.129.0.7     ip-10-0-145-114.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-gj4pf          1/1     Running            0          5h      10.0.223.37    ip-10-0-223-37.us-east-2.compute.internal    <none>           <none>
machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          97m     10.128.0.4     ip-10-0-143-111.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-l95x2          1/1     Running            0          4h57m   10.0.172.211   ip-10-0-172-211.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-p6sw6          1/1     Running            0          5h      10.0.146.227   ip-10-0-146-227.us-east-2.compute.internal   <none>           <none>
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
ip-10-0-143-111.us-east-2.compute.internal   Ready    worker   4h24m   v1.24.0-2566+5157800f2a3bc3   10.0.143.111   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
ip-10-0-145-114.us-east-2.compute.internal   Ready    worker   4h18m   v1.24.0-2566+5157800f2a3bc3   10.0.145.114   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-145-114.us-east-2.compute.internal   aws:///us-east-2a/i-0b69d52c625c46a6a   running
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-143-111.us-east-2.compute.internal   aws:///us-east-2a/i-05e422c0051707d16   running

This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-12-16-190443   True        True          141m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
$ oc version
Client Version: 4.11.0-0.ci-2022-06-09-065118
Kustomize Version: v4.5.4
Server Version: 4.11.0-0.nightly-2022-12-16-190443
Kubernetes Version: v1.25.4+77bec7a

How reproducible:

Always

Steps to Reproduce:

1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet
2. Perform the upgrade to 4.12
3. Wait for the upgrade to hang on the machine-api component

Actual results:

The upgrade hangs when upgrading the machine-api component.

Expected results:

The upgrade suceeds

Additional info:


This is a clone of issue OCPBUGS-2633. The following is the description of the original issue:

Description of problem:

There are different versions, channel for the operator, but may be they use the same 'latest' label, when mirroring them as `additionalImages`, got the below error:

[root@ip-172-31-249-209 jian]# oc-mirror --config mirror.yaml file:///root/jian/test/
...
...
sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1 file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest
sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
info: Mirroring completed in 22.48s (125.8MB/s)
error: one or more errors occurred while uploading images

Version-Release number of selected component (if applicable):

[root@ip-172-31-249-209 jian]# oc-mirror version
Client Version: version.Info{Major:"0", Minor:"1", GitVersion:"v0.1.0", GitCommit:"6ead1890b7a21b6586b9d8253b6daf963717d6c3", GitTreeState:"clean", BuildDate:"2022-08-25T05:27:39Z", GoVersion:"go1.17.12", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1. use the below config:
[cloud-user@preserve-olm-env2 mirror-tmp]$ cat mirror.yaml
apiVersion: mirror.openshift.io/v1alpha1
kind: ImageSetConfiguration
# archiveSize: 4
mirror:
  additionalImages:
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:46a62d73aeebfb72ccc1743fc296b74bf2d1f80ec9ff9771e655b8aa9874c933
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:9e549c09edc1793bef26f2513e72e589ce8f63a73e1f60051e8a0ae3d278f394
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:c16891ee9afeb3fcc61af8b2802e56605fff86a505e62c64717c43ed116fd65e
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:5c37bd168645f3d162cb530c08f4c9610919d4dada2f22108a24ecdea4911d60
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:89a6abbf10908e9805d8946ad78b98a13a865cefd185d622df02a8f31900c4c1
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:de5b339478e8e1fc3bfd6d0b6784d91f0d3fbe0a133354be9e9d65f3d7906c2d
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:fdf774c4365bde48d575913d63ef3db00c9b4dda5c89204029b0840e6dc410b1
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:15cc75164335fa178c80db4212d11e4a793f53d2b110c03514ce4c79a3717ca0
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:9e66db3a282ee442e71246787eb24c218286eeade7bce4d1149b72288d3878ad
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:546b14c1f3fb02b1a41ca9675ac57033f2b01988b8c65ef3605bcc7d2645be60
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:12d7061012fd823b57d7af866a06bb0b1e6c69ec8d45c934e238aebe3d4b68a5
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:41025e3e3b72f94a3290532bdd6cabace7323c3086a9ce434774162b4b1dd601
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:92542b22911fbd141fadc53c9737ddc5e630726b9b53c477f4dfe71b9767961f
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:1feb7073dec9341cadcc892df39ae45c427647fb034cf09dce1b7aa120bbb459
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:7ca05f93351959c0be07ec3af84ffe6bb5e1acea524df210b83dd0945372d432
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:c0fe8830f8fdcbe8e6d69b90f106d11086c67248fa484a013d410266327a4aed
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:b386d0e1c9e12e9a3a07aa101257c6735075b8345a2530d60cf96ff970d3d21a


2. Run the 
$ oc-mirror --config mirror.yaml file:///root/jian/test/  

Actual results:

error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists

Expected results:

No error

Additional info:

 

Description

As a user, I would like to see the type of technology used by the samples on the samples view similar to the all services view. 

On the samples view:

It is showing different types of samples, e.g. devfile, helm and all showing as .NET. It is difficult for user to decide which .Net entry to select on the list. We'll need something like the all service view where it shows the type of technology on the top right of each card for users to differentiate between the entries:

Acceptance Criteria

  1. Add visible label as the all services view on each card to show the technology used by the sample on the samples view.

Additional Details:

Description of problem:

"OPECO-2646: exclude bundles with olm.deprecated property when rendering" not backport on 4.13?

https://github.com/openshift/operator-framework-olm/pull/463

Version-Release number of selected component (if applicable):

https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/candidate-4.13/opm-linux.tar.gz

Steps to Reproduce:

1.curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/candidate-4.13/opm-linux.tar.gz -o opm-linux.tar.gz

2.[cloud-user@preserve-gvm-130 ~]$ tar zxvf opm-linux.tar.gz 
opm

3.[cloud-user@preserve-gvm-130 ~]$ ./opm render quay.io/olmqe/catalogtest-index:v4.12depre | grep olm.deprecate
WARN[0001] DEPRECATION NOTICE:
Sqlite-based catalogs and their related subcommands are deprecated. Support for
them will be removed in a future release. Please migrate your catalog workflows
to the new file-based catalog format. 
            "type": "olm.deprecated",


Actual results:

"type": "olm.deprecated",

Expected results:

nothing

This is a clone of issue OCPBUGS-11284. The following is the description of the original issue:

Description of problem:

When we rebased to 1.26, the rebase picked up https://github.com/kubernetes-sigs/cloud-provider-azure/pull/2653/ which made the Azure cloud node manager stop applying beta toplogy labels, such as failure-domain.beta.kubernetes.io/zone

Since we haven't completed the removal cycle for this, we still need the node manager to apply these labels. In the future we must ensure that these labels are available until users are no longer using them.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create a TP cluster on 4.13
2. Observe no beta label for zone or region
3.

Actual results:

Beta labels are not present

Expected results:

Beta labels are present and should match GA labels

Additional info:

Created https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3685 to try and make upstream allow this to be flagged

Description of problem:

Deployment of a standard masters+workers cluster using 4.13.0-rc.6 does not configure the cgroup structure according to OCPNODE-1539

Version-Release number of selected component (if applicable):

OCP 4.13.0-rc.6

How reproducible:

Always

Steps to Reproduce:

1. Deploy the cluster
2. Check for presence of /sys/fs/cgroup/cpuset/system*
3. Check the status of cpu balancing of the root cpuset cgroup (should be disabled)

Actual results:

No system cpuset exists and all services are still present in the root cgroup with cpu balancing enabled.

Expected results:

 

Additional info:

The code has a bug we missed. It is nested under the Workload partitioning check on line https://github.com/haircommander/cluster-node-tuning-operator/blob/123e26df30c66fd5c9836726bd3e4791dfd82309/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L251

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/87

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Cx is not able to install new cluster OCP BM IPI. During the bootstrapping the provisioning interfaces from master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP IPI BareMetal install 

Please refer to following BUG --> https://issues.redhat.com/browse/OCPBUGS-872  The problem was solved by applying rd.net.timeout.carrier=30 to the kernel parameters of compute nodes via cluster-baremetal operator. The fix also need to be apply to the control-plane. 

  ref:// https://github.com/openshift/cluster-baremetal-operator/pull/286/files

 

Version-Release number of selected component (if applicable):

 

How reproducible:

Perform OCP 4.10.16 IPI BareMetal install.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Customer should be able to install the cluster without any issue.

Additional info:

 

This is a public clone of OCPBUGS-3821

The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:

  1. the containerruntimeconfigcontroller creates a new containerruntimeconfig due to the update
  2. the template controller finishes re-creating the base configs
  3. the kubeletconfig errors long enough and doesn't finish until after 2

This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.

The rendezvous host must be one of the control plane nodes.

The user has control over the rendezvous IP in the agent-config. In the case where they also provide NMState config with static IPs, we are able to verify whether the rendezvous IP points to a control plane node or not. If it does not, we should fail.

Description of problem:
Due to changes in BUILD-407 which merged into release-4.12, we have a permafailing test `e2e-aws-csi-driver-no-refreshresource` and are unable to merge subsequent pull requests.

Version-Release number of selected component (if applicable):


How reproducible: Always

Steps to Reproduce:

1. Bring up cluster using release-4.12 or release-4.13 or master branch
2. Run `e2e-aws-csi-driver-no-refreshresource` test
3.

Actual results:
I1107 05:18:31.131666 1 mount_linux.go:174] Cannot run systemd-run, assuming non-systemd OS
I1107 05:18:31.131685 1 mount_linux.go:175] systemd-run failed with: exit status 1
I1107 05:18:31.131702 1 mount_linux.go:176] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down

Expected results:
Test should pass

Additional info:


Description of the problem:

When installing an SNO cluster with platformType == VSphere, there are no validations blocking the installation. This results in the cluster flapping between preparing-for-installation (ClusterAlreadyInstalling) and ready (The cluster currently requires 1 agents but only 0 have registered). I would expect the cluster to be pending user input to change platformType to none. Currently there is not indicator why the installation isn't actually starting, the logs only show that it is failing to generate the install config.

 

How reproducible:

100%

 

Steps to reproduce:

1. Create an SNO spoke cluster with platformType set to VSphere

 

Actual results:

Cluster cannot successfully install and doesn't indicate what the issue is. RequirementsMet conditions indefinitely flaps between true and false

 

Expected results:

Validation blocks the installation from starting and indicates that SNO cluster is incompatible with VSphere platform

This is a clone of issue OCPBUGS-11187. The following is the description of the original issue:

Description of problem:

EgressIP was NOT migrated to correct workers after deleting machine it was assigned in GCP XPN cluster.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-29-235439

How reproducible:

Always

Steps to Reproduce:

1. Set up GCP XPN cluster.
2. Scale two new worker nodes
% oc scale --replicas=2 machineset huirwang-0331a-m4mws-worker-c -n openshift-machine-api        
machineset.machine.openshift.io/huirwang-0331a-m4mws-worker-c scaled

3. Wait the two new workers node ready.
 % oc get machineset -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
huirwang-0331a-m4mws-worker-a   1         1         1       1           86m
huirwang-0331a-m4mws-worker-b   1         1         1       1           86m
huirwang-0331a-m4mws-worker-c   2         2         2       2           86m
huirwang-0331a-m4mws-worker-f   0         0                             86m
% oc get nodes
NAME                                                          STATUS   ROLES                  AGE     VERSION
huirwang-0331a-m4mws-master-0.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-master-1.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-master-2.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal   Ready    worker                 71m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal   Ready    worker                 71m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   Ready    worker                 8m22s   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal   Ready    worker                 8m22s   v1.26.2+dc93b13
3. Label one new worker node as egress node
 % oc label node huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" 
node/huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal labeled

4. Create egressIP object
oc get egressIP
NAME         EGRESSIPS     ASSIGNED NODE                                                 ASSIGNED EGRESSIPS
egressip-1   10.0.32.100   huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   10.0.32.100
5. Label second new worker node as egress node 
% oc label node huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" 
node/huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal labeled
6. Delete the assigned egress node
% oc delete machines.machine.openshift.io huirwang-0331a-m4mws-worker-c-rhbkr  -n openshift-machine-api
machine.machine.openshift.io "huirwang-0331a-m4mws-worker-c-rhbkr" deleted
 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE   VERSION
huirwang-0331a-m4mws-master-0.c.openshift-qe.internal         Ready    control-plane,master   87m   v1.26.2+dc93b13
huirwang-0331a-m4mws-master-1.c.openshift-qe.internal         Ready    control-plane,master   86m   v1.26.2+dc93b13
huirwang-0331a-m4mws-master-2.c.openshift-qe.internal         Ready    control-plane,master   87m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal   Ready    worker                 76m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal   Ready    worker                 76m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal   Ready    worker                 13m   v1.26.2+dc93b13
29468 W0331 02:48:34.917391       1 egressip_healthcheck.go:162] Could not connect to huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal (10.129.4.2:9107): context       deadline exceeded
29469 W0331 02:48:34.917417       1 default_network_controller.go:903] Node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal is not ready, deleting it from egre      ss assignment
29470 I0331 02:48:34.917590       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:Logical_Switch_Port Row:map[o      ptions:{GoMap:map[router-port:rtoe-GR_huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column       _uuid == {6efd3c58-9458-44a2-a43b-e70e669efa72}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
29471 E0331 02:48:34.920766       1 egressip.go:993] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi      ch is not reachable, will attempt rebalancing
29472 E0331 02:48:34.920789       1 egressip.go:997] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi      ch is not ready, will attempt rebalancing
29473 I0331 02:48:34.920808       1 egressip.go:1212] Deleting pod egress IP status: {huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal 10.0.32.100} for EgressIP:       egressip-1

Actual results:

The egressIP was not migrated to correct worker
 oc get egressIP      
NAME         EGRESSIPS     ASSIGNED NODE                                                 ASSIGNED EGRESSIPS
egressip-1   10.0.32.100   huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   10.0.32.100

Expected results:

The egressIP should migrated to correct worker from deleted node.

Additional info:


Description of problem:

Branch name in repository pipelineruns list view should match the actual github branch name.

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

alwaus

Steps to Reproduce:

1. Create a repository
2. Trigger the pipelineruns by push or pull request event on the github 

Actual results:

Branch name contains "refs-heads-" prefix in front of the actual branch name eg: "refs-heads-cicd-demo" (cicd-demo is the branch name)

Expected results:

Branch name should be the acutal github branch name. just `cicd-demo`should be shown in the branch column.

 

Additional info:
Ref: https://coreos.slack.com/archives/CHG0KRB7G/p1667564311865459

User Story:

As an openshift dev, I want to be able to view trace-level Terraform logs in CI so that I can troubleshoot failures and open issues upstream.

Acceptance Criteria:

Description of criteria:

  • Trace level logs are captured
  • Ideally detailed logs would be captured in a separate file. If this is impossible, the trace-level output may be too verbose to be acceptable in the normal output.

Engineering Details:

Description of the problem:

Spoke cluster with vlan set using nmstate fails to deploy (from node journal):

Dec 04 07:09:48 localhost.localdomain inventory[1986]: time="04-12-2022 07:09:48" level=info msg="Executing biosdevname [-i enp1s0]" file="execute.go:39"
Dec 04 07:09:48 localhost.localdomain inventory[1986]: time="04-12-2022 07:09:48" level=info msg="Executing biosdevname [-i enp1s0.404]" file="execute.go:39"
Dec 04 07:09:48 localhost.localdomain inventory[1986]: time="04-12-2022 07:09:48" level=info msg="Executing biosdevname [-i cni-podman0]" file="execute.go:39"
Dec 04 07:09:48 localhost.localdomain inventory[1986]: time="04-12-2022 07:09:48" level=info msg="Executing dmidecode [-t 17]" file="execute.go:39"
Dec 04 07:09:48 localhost.localdomain inventory[1986]: time="04-12-2022 07:09:48" level=info msg="Executing cat [/sys/class/tpm/tpm0/tpm_version_major]" file="execute.go:39"
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x70 pc=0x89323c]
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: 
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: goroutine 1 [running]:
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: github.com/openshift/assisted-installer-agent/src/inventory.calculateHostname(0x0?)
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]:         /remote-source/app/src/inventory/inventory.go:99 +0x3c
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: github.com/openshift/assisted-installer-agent/src/inventory.processInventory(0xc0000ac480)
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]:         /remote-source/app/src/inventory/inventory.go:54 +0x74
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: github.com/openshift/assisted-installer-agent/src/inventory.ReadInventory(0xc0000b0960, 0xc000366fc0?)
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]:         /remote-source/app/src/inventory/inventory.go:31 +0x53e
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: github.com/openshift/assisted-installer-agent/src/inventory.CreateInventoryInfo(0xc0000b0960)
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]:         /remote-source/app/src/inventory/inventory.go:36 +0x58
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]: main.main()
Dec 04 07:09:48 localhost.localdomain nifty_dewdney[1974]:         /remote-source/app/src/inventory/main/main.go:15 +0x99
Dec 04 07:09:48 localhost.localdomain systemd[1]: libpod-e6125c23773b69356e2b30628829de6013daa597472917e7dfdd285cdfdd2269.scope: Succeeded.

The panic just continues in a loop.

How reproducible:
100% so far
 
Occurring so far on:

  • Spoke clusters trying to deploy 4.12
  • Using vlan interface set via nmstateconfig
  • ipv4 connected + ipv6 disconnected
  • SNO, HA multinode, Multinode Non-platform

Versions:

  • Spoke cluster: registry.ci.openshift.org/ocp/release:4.12.0-rc.2
  • MCE bundle: registry-proxy.engineering.redhat.com/rh-osbs/multicluster-engine-mce-operator-bundle:v2.2.0-229
  • Hub cluster: 4.12

Steps to reproduce:

1. Deploy spoke cluster from assisted operator using ztp flow and set interface to use vlan

Actual results:
Node journal reports: panic: runtime error: invalid memory address or nil pointer dereference 

Expected results:
Cluster deploys successfully

Description of problem:


--> Service name search ability while creating the Route from the console

2. What is the nature and description of the request?
--> While creating the route from the console(OCP dashboard) there is no option to search the service by name, we need to select the service from the drop-down list only, we need the searchability so that the user can type the service name and can select the service which comes at the top in search results.

3. Why does the customer need this? (List the business requirements here)
--> Sometimes it is a very hectic task to select the service from the drop-down list, In one of the customer case they have 150 services in the namespace and they need to scroll down too long for selecting the service.

4. List any affected packages or components.
--> OCP console

5. Expected result.
--> Have the ability to type the service name while creating the route.

Description of the problem:
Assisted service generates an icsp for a spoke cluster install config using the registries.conf content from the mirror configmap.

In the case of multiple mirror to entries for a single source, the install-config will only contain the first mirror to entry.
 
A case where a customer could run into this is when some images were mirrored using

oc adm release mirror

and then later switched to

oc-mirror

The two commands put release and payload images in different paths, so both could be required as options.

How reproducible:
100%
 

Steps to reproduce:

1. Create mirror configmap containing registries.conf entry with multiple registry mirror entries, ex:

oc get configmap mirror-registry-ca  -o yaml
[..]
    [[registry]]
      prefix = ""
      location = "registry.ci.openshift.org/ocp"
      mirror-by-digest-only = true
 
      [[registry.mirror]]
        location = "registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/ocp"
 
      [[registry.mirror]]
        location = "registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/openshift"
 
      [[registry.mirror]]
        location = "registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/openshift4"
[..]

2. Attempt to deploy cluster using ZTP

3. Find cluster install-config.yaml file under /data within the assisted service pod
ex: cat /data/7b61d395-07a5-45c2-89a8-94d6ad17a635/install-config.yaml

Actual results:
Install-config icsp section shows:

[..]
	imageContentSources:
	- mirrors:
	  - registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/ocp
	  source: registry.ci.openshift.org/ocp
[..]

Expected results:
[..]

	imageContentSources:
	- mirrors:
	  - registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/ocp
          - registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/openshift
          - registry.ocp-edge-cluster-assisted-0.qe.lab.redhat.com:5005/openshift4
	  source: registry.ci.openshift.org/ocp'
[..]

This is a clone of issue OCPBUGS-7440. The following is the description of the original issue:

Description of problem:

while trying to figure out why it takes so long to install Single node OpenShift I noticed that the kube-controller-manager cluster operator is degraded for ~5 minutes due to:
GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused
I don't understand how the prometheusClient is successfully initialized, but we get a connection refused once we try to query the rules.
Note that if the client initialization fails the kube-controller-manger won't set the  GarbageCollectorDegraded to true.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc)

2. monitor the cluster operators staus 

Actual results:

GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused 

Expected results:

Expected the GarbageCollectorDegraded status to be false

Additional info:

It seems that for PrometheusClient to be successfully initialised it needs to successfully create a connection but we get connection refused once we make the query.
Note that installing SNO with this patch (https://github.com/eranco74/cluster-kube-controller-manager-operator/commit/26e644503a8f04aa6d116ace6b9eb7b9b9f2f23f) reduces the installation time by 3 minutes


This is a clone of issue OCPBUGS-12775. The following is the description of the original issue:

Description of problem:

We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Azure IPI creates boot images using the image gallery API now, it will create two image definition resources for both hyperVGeneration V1 and V2. For arm64 cluster, the architecture in image definition hyperVGeneration V1 is x64, but it should be Arm64

Version-Release number of selected component (if applicable):

./openshift-install version
./openshift-install 4.12.0-0.nightly-arm64-2022-10-07-204251
built from commit 7b739cde1e0239c77fabf7622e15025d32fc272c
release image registry.ci.openshift.org/ocp-arm64/release-arm64@sha256:d2569be4ba276d6474aea016536afbad1ce2e827b3c71ab47010617a537a8b11
release architecture arm64

How reproducible:

always

Steps to Reproduce:

1.Create arm cluster using latest arm64 nightly build 
2.Check image definition created for hyperVGeneration V1

Actual results:

The architecture field is x64.
###
$ az sig image-definition show --gallery-name ${gallery_name} --gallery-image-definition lwanazarm1008-rc8wh --resource-group ${rg} | jq -r ".architecture"
x64
The image version under this image definition is for aarch64.
###
$ az sig image-version show --gallery-name gallery_lwanazarm1008_rc8wh --gallery-image-definition lwanazarm1008-rc8wh --resource-group lwanazarm1008-rc8wh-rg --gallery-image-version 412.86.20220922 | jq -r ".storageProfile.osDiskImage.source"
{  "uri": "https://clustermuygq.blob.core.windows.net/vhd/rhcosmuygq.vhd"}
$ az storage blob show --container-name vhd --name rhcosmuygq.vhd --account-name clustermuygq --account-key $account_key | jq -r ".metadata"
{  "Source_uri": "https://rhcos.blob.core.windows.net/imagebucket/rhcos-412.86.202209220538-0-azure.aarch64.vhd"}

Expected results:

Although no VMs with HypergenV1 can be provisioned, the architecture field should be Arm64 even for hyperGenerationV1 image definitions

Additional info:

1.The architecture in image definition hyperVGeneration V2 is Arm64 and installer will use V2 by default for arm64 vm_type, so installation didn't fail by default. But we still need to make architecture consistent in V1.

2.Need to set architecture field for both V1 and V2, now we only set architecture in V2 image definition resource. 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L100-L128 

Description of problem:

When a MCCDrainError alert is triggered the alert's message says that the drain problem is happening in the wrong node.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-12-22-120609   True        False         4h59m   Cluster version is 4.13.0-0.nightly-2022-12-22-120609

How reproducible:

Always

Steps to Reproduce:

1. Create a PodDisruptionBudget resource

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
        app: dontevict

2. Create a pod matching the PodDisruptionBudget

$ oc run --restart=Never --labels app=dontevict  --image=docker.io/busybox dont-evict-this-pod -- sleep 3h


3. Create a MC

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        mode: 420
        path: /etc/test

4. Wait 1 hour for the MCCDrainError alert to be triggered

Actual results:


The alert is like

$ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring alertmanager-main -o jsonpath={.spec.host})/api/v1/alerts | jq 
.....
 {
    "activeAt": "2022-12-23T11:24:05.807925776Z",
    "annotations": {
        "message": "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may be blocked. For more details check MachineConfigController pod logs: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller"
    },
    "labels": {
        "alertname": "MCCDrainError",
        "container": "oauth-proxy",
        "endpoint": "metrics",
        "instance": "10.130.0.10:9001",
        "job": "machine-config-controller",
        "namespace": "openshift-machine-config-operator",
        "node": "ip-10-0-193-114.us-east-2.compute.internal",
        "pod": "machine-config-controller-5468769874-44tnt",
        "service": "machine-config-controller",
        "severity": "warning"
    },
    "state": "firing",
    "value": "1e+00"
}

The alert message is wrong, since the reported not in "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may....." is not the node where the drain problem happened, but the node running the controller pod.

Expected results:


The alert message should not point to a wrong node, since it can mislead the user.

Additional info:


Description of problem:

Frequently we see the loading state of the topology view, even when there aren't many resources in the project.

Including an example

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. load topology
  2. if it loads successfully, keep trying  until it fails to load

Actual results:

topology will sometimes hang with the loading indicator showing indefinitely

Expected results:

topology should load consistently without fail

Reproducibility (Always/Intermittent/Only Once):

intermittent

Build Details:

4.9

Additional info:

Description of problem:

Git icon shown in the repository details page should be based on the git provider.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1. Create a Repository with gitlab repo url
2. Trigger a PLR for the repository 
3. Navigates to PLR details page

Actual results:

github icon is displayed for the gitlab url and URL is not correct

Expected results:

gitlab icon should be displayed for the gitlab url. And repository URL should be correct

Additional info:

use `GitLabIcon` and `BitBucketIcon` from patternfly react-icons.

Description of problem:

in the case where infrastructure-agent fails to reconcile assisted-service and konnectivity-agent, a panic runtime error occurs that causes the infrastructure-operator to enter a CrashLoopBackOff state.
This situation should be caught and handled gracefully.

How reproducible:

create a situation where the infrastructure operator must reconcile ssisted-service and konnectivity-agent, when the konnectivity agent is missing.
(discovered when deploying  hypershift zero workers HypershiftAgentServiceConfig in an MCE environment. other situations are possible)

Actual results:

$ oc get pods -n multicluster-engine |grep infra
infrastructure-operator-9d5ddc6d-plnrg                 0/1     CrashLoopBackOff   8 (3m43s ago)   81m

$ oc logs -n multicluster-engine infrastructure-operator-9d5ddc6d-plnrg 

...
time="2022-12-11T15:17:04Z" level=info msg="reconciling the assisted-service konnectivity agent" go-id=1642 hypershift_service_config=hypershift-agent hypershift_service_namespace=spoke-0 request_id=dd9d81ec-3ff5-4d74-9138-5ad8ad005046
time="2022-12-11T15:17:04Z" level=error msg="Failed retrieving konnectivity-agend Deployment from namespace spoke-0" error="Deployment.apps \"konnectivity-agent\" not found" go-id=1642 hypershift_service_config=hypershift-agent hypershift_service_namespace=spoke-0 request_id=dd9d81ec-3ff5-4d74-9138-5ad8ad005046
time="2022-12-11T15:17:04Z" level=info msg="HypershiftAgentServiceConfig Reconcile ended" go-id=1642 hypershift_service_config=hypershift-agent hypershift_service_namespace=spoke-0 request_id=dd9d81ec-3ff5-4d74-9138-5ad8ad005046
panic: runtime error: index out of range [0] with length 0
...

Expected results:

graceful error message, infrastructure operator doesn't enter in CrashLoopBackOff state

 

Description of problem:

we don't close the filter dropdown in https://github.com/openshift/console/blob/master/frontend/packages/integration-tests-cypress/views/list-page.ts#L56 so it is impossible to filter by multiple items in sequence

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:

level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []

Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.

I'm setting the priority to critical as it's causing all our jobs to fail in master.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.

One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby 
   /etc/node-sizing.env 
on its master nodes contained an empty SYSTEM_RESERVED_ES value:

---
cat c
SYSTEM_RESERVED_MEMORY=5.36Gi
SYSTEM_RESERVED_CPU=0.11
SYSTEM_RESERVED_ES=
---

causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.

We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.

A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES

This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. 

For clusterB the conditions are more well-known of why the value is empty.

However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. 

We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps  through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

Expected results:

If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

Additional info:

 

This is a clone of issue OCPBUGS-10638. The following is the description of the original issue:

Description of problem:

Agent create sub-command is showing fatal error when executing invalid command.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Execute `openshift-install agent create invalid`

Actual results:

FATA[0000] Error executing openshift-install: accepts 0 arg(s), received 1 

Expected results:

It should return the help of the create command.

Additional info:

 

Presently when there are NTP sync issues where servers are out of sync, we advise the user in the following way.

please configure an NTP server via DHCP or set clock manually

This has lead to a number of Triage tickets where there are TLS issues due to a loss of clock sync for a node.

https://issues.redhat.com/browse/AITRIAGE-4985
https://issues.redhat.com/browse/AITRIAGE-4981
https://issues.redhat.com/browse/AITRIAGE-4937
https://issues.redhat.com/browse/AITRIAGE-4936
https://issues.redhat.com/browse/AITRIAGE-4926
https://issues.redhat.com/browse/AITRIAGE-4800

We should change the text so that advice on how to set the clock manually in a manner that will persist across reboots.

please configure an NTP server via DHCP or set clock manually in a persistent fashion <add example here>

We are unsure of what this method of persisting the manual clock change should be so this will need to be investigated as part of the ticket.

For more information, please see this topic
https://redhat-internal.slack.com/archives/C02CP89N4VC/p1673524970034749

Tracker issue for bootimage bump in 4.13. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-11789.

Currently the agent installer supports only x86_64 arch. The image creation command must fail if some other arch is configured different from x86_64

We want to have an allowed list of architectures.

allowed = ['x86_64', 'amd64']

Description of problem:

when using the OnDelete update method of the ControlPlaneMachineSet, it should not be possible to have multiple machines in the Running phase in the same machine index. eg, if machine-1 is in Running phase, we should not have a machine-replacement-1 also in the Running phase.

Version-Release number of selected component (if applicable):

4.12 / main branch

How reproducible:

unsure, this is currently not tested in the code and is difficult to produce

Steps to Reproduce:

1. setup a cluster with CPMS in OnDelete update mode
2. rename one of the master machines to have the same index as another, or manually create a machine to match. this step might be difficult to reproduce.
3. observe logs from CPMS operator

Actual results:

no errors are emitted about the extra machine, although perhaps others are. operator does not degrade.

Expected results:

an error should be produced and the operator should go degraded

Additional info:

this bug is slightly predictive, we have not observed this condition but have detected a gap in the code that might make it possible.

This is a clone of issue OCPBUGS-11072. The following is the description of the original issue:

Dummy bug to track adding the test to openshift/origin.

Description of problem:

See: https://issues.redhat.com/browse/CPSYN-143

tldr:  Based on the previous direction that 4.12 was going to enforce PSA restricted by default, OLM had to make a few changes because the way we run catalog pods (and we have to run them that way because of how the opm binary worked) was incompatible w/ running restricted.

1) We set openshift-marketplace to enforce restricted (this was our choice, we didn't have to do it, but we did)
2) we updated the opm binary so catalog images using a newer opm binary don't have to run privileged
3) we added a field to catalogsource that allows you to choose whether to run the pod privileged(legacy mode) or restricted.  The default is restricted.  We made that the default so that users running their own catalogs in their own NSes (which would be default PSA enforcing) would be able to be successful w/o needing their NS upgraded to privileged.

Unfortunately this means:
1) legacy catalog images(i.e. using older opm binaries) won't run on 4.12 by default (the catalogsource needs to be modified to specify legacy mode.
2) legacy catalog images cannot be run in the openshift-marketplace NS since that NS does not allow privileged pods.  This means legacy catalogs can't contribute to the global catalog (since catalogs must be in that NS to be in the global catalog).

Before 4.12 ships we need to:
1) remove the PSA restricted label on the openshift-marketplace NS
2) change the catalogsource securitycontextconfig mode default to use "legacy" as the default, not restricted.

This gives catalog authors another release to update to using a newer opm binary that can run restricted, or get their NSes explicitly labeled as privileged (4.12 will not enforce restricted, so in 4.12 using the legacy mode will continue to work)

In 4.13 we will need to revisit what we want the default to be, since at that point catalogs will start breaking if they try to run in legacy mode in most NSes.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


OCPBUGS-3278 is supposed to fix the issue where the user was required to provide data about the baremetal hosts (including MAC addresses) in the install-config, even though this data is ignored.

However, we determine whether we should disable the validation by checking the second CLI arg to see if it is agent.

This works when the command is:

openshift-install agent create image --dir=whatever

But fails when the argument is e.g., as in dev-scripts:

openshift-install --log-level=debug --dir=whatever agent create image

Description of problem:

When you migrate a HostedCluster, the AWSEndpointService conflicts from the old MGMT Server with the new MGMT Server. The AWSPrivateLink_Controller does not have any validation when this happens. This is needed to make the Disaster Recovery HC Migration works. So the issue will raise up when the nodes of the HostedCluster cannot join the new Management cluster because the AWSEndpointServiceName is still pointing to the old one.

Version-Release number of selected component (if applicable):

4.12
4.13
4.14

How reproducible:

Follow the migration procedure from upstream documentation and the nodes in the destination HostedCluster will keep in NotReady state.

Steps to Reproduce:

1. Setup a management cluster with the 4.12-13-14/main version of the HyperShift operator.
2. Run the in-place node DR Migrate E2E test from this PR https://github.com/openshift/hypershift/pull/2138:
bin/test-e2e \
  -test.v \
  -test.timeout=2h10m \
  -test.run=TestInPlaceUpgradeNodePool \
  --e2e.aws-credentials-file=$HOME/.aws/credentials \
  --e2e.aws-region=us-west-1 \
  --e2e.aws-zones=us-west-1a \
  --e2e.pull-secret-file=$HOME/.pull-secret \
  --e2e.base-domain=www.mydomain.com \
  --e2e.latest-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \
  --e2e.previous-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \
  --e2e.skip-api-budget \
  --e2e.aws-endpoint-access=PublicAndPrivate

Actual results:

The nodes stay in NotReady state

Expected results:

The nodes should join the migrated HostedCluster

Additional info:

 

This is a clone of issue OCPBUGS-10526. The following is the description of the original issue:

Description of problem:


Version-Release number of selected component (if applicable):

 4.13.0-0.nightly-2023-03-17-161027 

How reproducible:

Always

Steps to Reproduce:

1.  Create a GCP XPN cluster with flexy job template ipi-on-gcp/versioned-installer-xpn-ci, then 'oc descirbe node'

2. Check logs for cloud-network-config-controller pods

Actual results:


 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE    VERSION
huirwang-0309d-r85mj-master-0.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-1.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-2.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-b-5txgq.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
 `oc describe node`, there is no related egressIP annotations 
% oc describe node huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal 
Name:               huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n2-standard-4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
                    kubernetes.io/os=linux
                    machine.openshift.io/interruptible-instance=
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=n2-standard-4
                    node.openshift.io/os_id=rhcos
                    topology.gke.io/zone=us-central1-a
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-a
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/huirwang-0309d-r85mj-worker-a-wsrls"}
                    k8s.ovn.org/host-addresses: ["10.0.32.117"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal","mac-address":"42:01:0a:00:...
                    k8s.ovn.org/node-chassis-id: 7fb1870c-4315-4dcb-910c-0f45c71ad6d3
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 16:52:e3:8c:13:e2
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.117/32"}
                    k8s.ovn.org/node-subnets: {"default":["10.131.0.0/23"]}
                    machine.openshift.io/machine: openshift-machine-api/huirwang-0309d-r85mj-worker-a-wsrls
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true


 % oc logs cloud-network-config-controller-5cd96d477d-2kmc9  -n openshift-cloud-network-config-controller  
W0320 03:00:08.981493       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0320 03:00:08.982280       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
E0320 03:00:38.982868       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com: i/o timeout
E0320 03:01:23.863454       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: read udp 10.129.0.14:52109->172.30.0.10:53: read: connection refused
I0320 03:02:19.249359       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0320 03:02:19.250662       1 controller.go:88] Starting node controller
I0320 03:02:19.250681       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0320 03:02:19.250693       1 controller.go:88] Starting secret controller
I0320 03:02:19.250703       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0320 03:02:19.250709       1 controller.go:88] Starting cloud-private-ip-config controller
I0320 03:02:19.250715       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0320 03:02:19.258642       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258671       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258682       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal to node workqueue
I0320 03:02:19.351258       1 controller.go:96] Starting node workers
I0320 03:02:19.351303       1 controller.go:102] Started node workers
I0320 03:02:19.351298       1 controller.go:96] Starting secret workers
I0320 03:02:19.351331       1 controller.go:102] Started secret workers
I0320 03:02:19.351265       1 controller.go:96] Starting cloud-private-ip-config workers
I0320 03:02:19.351508       1 controller.go:102] Started cloud-private-ip-config workers
E0320 03:02:19.589704       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.615551       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.644628       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.774047       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.783309       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.816430       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue

Expected results:

EgressIP should work

Additional info:

It can be reproduced in  4.12 as well, not regression issue.

This is a clone of issue OCPBUGS-10916. The following is the description of the original issue:

Description of problem:

Seeing `Secret {{newImageSecret}} was created.` string for the created Image pull secret alert in the Container image flow.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Navigate +Add page
2. Open the Container Image form
3. click on Create an Image pull secret link and create a secret

Actual results:

Secret {{newImageSecret}} was created. get render in the alert

Expected results:

Secret <-Secret name-> was created. should render in the alert

Additional info:

 

Description of problem:

fail to get clear error message when zones is not match with the the subnets in BYON

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. install-config.yaml 
 yq '.controlPlane.platform.ibmcloud.zones,.platform.ibmcloud.controlPlaneSubnets' install-config.yaml 
["ca-tor-1", "ca-tor-2", "ca-tor-3"]
- ca-tor-existing-network-1-cp-ca-tor-2
- ca-tor-existing-network-1-cp-ca-tor-3
2. openshift-install create manifests --dir byon-az-test-1

Actual results:

FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: no subnet found for ca-tor-1

Expected results:

more clear error message in install-config.yaml

Additional info:

 

 

 

 

After upgrading the Hub Cluster from 4.9.x to 4.11.17, we can't remove AgentClusterInstall CRs.

We have ~15 managed clusters (all SNO) which were deployed using assisted service CRs under an older version (ACM 2.4.6). The hub was then upgraded to ACM 2.5.5 then 2.6.2. Any attempt to delete an AgentClusterInstall CR doesn't complete with this message showing up in the logs:

time="2022-11-30T18:33:01Z" level=error msg="failed to remove finalizer agentclusterinstall.agent-install.openshift.io/ai-deprovision from resource cnfde22 cnfde22" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeploymentsReconciler).agentClusterInstallFinalizer" file="/remote-source/assisted-service/app/internal/controller/controllers/clusterdeployments_controller.go:300" agent_cluster_install=cnfde22 agent_cluster_install_namespace=cnfde22 cluster_deployment=cnfde22 cluster_deployment_namespace=cnfde22 error="admission webhook \"agentclusterinstallvalidators.admission.agentinstall.openshift.io\" denied the request: Attempted to change AgentClusterInstall.Spec which is immutable after install started, except for ClusterMetadata fields. Unsupported change: \n\tNetworking.UserManagedNetworking: (<nil> => 0xc000baeeaa)" go-id=903 request_id=fe238091-7b1b-4f4c-8d17-1e2ddc0892b3

Description of problem:

When running openshift-install agent create image, and the install-config.yaml does not contain platform baremetal settings (except for VIPs) warnings are still generated as below:
DEBUG         Loading Install Config...            
WARNING Platform.Baremetal.ClusterProvisioningIP: 172.22.0.3 is ignored 
DEBUG Platform.Baremetal.BootstrapProvisioningIP: 172.22.0.2 is ignored 
WARNING Platform.Baremetal.ExternalBridge: baremetal is ignored 
WARNING Platform.Baremetal.ExternalMACAddress: 52:54:00:12:e1:68 is ignored 
WARNING Platform.Baremetal.ProvisioningBridge: provisioning is ignored 
WARNING Platform.Baremetal.ProvisioningMACAddress: 52:54:00:82:91:8d is ignored 
WARNING Platform.Baremetal.ProvisioningNetworkCIDR: 172.22.0.0/24 is ignored 
WARNING Platform.Baremetal.ProvisioningDHCPRange: 172.22.0.10,172.22.0.254 is ignored 
WARNING Capabilities: %!!(MISSING)s(*types.Capabilities=<nil>) is ignored 

It looks like these fields are populated with values from libvirt as shown in .openshift_install_state.json:
            "platform": {
                "baremetal": {
                    "libvirtURI": "qemu:///system",
                    "clusterProvisioningIP": "172.22.0.3",
                    "bootstrapProvisioningIP": "172.22.0.2",
                    "externalBridge": "baremetal",
                    "externalMACAddress": "52:54:00:12:e1:68",
                    "provisioningNetwork": "Managed",
                    "provisioningBridge": "provisioning",
                    "provisioningMACAddress": "52:54:00:82:91:8d",
                    "provisioningNetworkInterface": "",
                    "provisioningNetworkCIDR": "172.22.0.0/24",
                    "provisioningDHCPRange": "172.22.0.10,172.22.0.254",
                    "hosts": null,
                    "apiVIPs": [
                        "10.1.101.7",
                        "2620:52:0:165::7"
                    ],
                    "ingressVIPs": [
                        "10.1.101.9",
                        "2620:52:0:165::9"
                    ]

The install-config.yaml used to generate this has the following snippet:
platform:
  baremetal:
    apiVIPs:
    - 10.1.101.7
    - 2620:52:0:165::7
    ingressVIPs:
    - 10.1.101.9
    - 2620:52:0:165::9
additionalTrustBundle: |

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Happens every time

Steps to Reproduce:

1. Use install-config.yaml with no platform baremetal fields except for the VIPs
2. run openshift-install agent create image 

Actual results:

Warning messages are output

Expected results:

No warning messags

Additional info:

 

Description of problem:

When installing on vSphere we are getting this error mentioned in title.

Started to see the error in ODF QE CI probably after merge of this change: https://github.com/openshift/installer/commit/8579a12abd50a1b84c115ba71336e446698239b1

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-22-192922

How reproducible:

 

Steps to Reproduce:

1. Installing OCP 4.13 nightly
2. Using this template for rendering config file: https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/templates/ocp-deployment/install-config-vsphere-upi.yaml.j2
3. Getting error

Actual results:

time="2023-02-23T02:09:55Z" level=debug msg="OpenShift Installer 4.13.0-0.nightly-2023-02-22-192922"
time="2023-02-23T02:09:55Z" level=debug msg="Built from commit 50e40714db6d5b3adddb25353112ada8058a469f"
time="2023-02-23T02:09:55Z" level=debug msg="Fetching Master Machines..."
time="2023-02-23T02:09:55Z" level=debug msg="Loading Master Machines..."
time="2023-02-23T02:09:55Z" level=debug msg="  Loading Cluster ID..."
time="2023-02-23T02:09:55Z" level=debug msg="    Loading Install Config..."
time="2023-02-23T02:09:55Z" level=debug msg="      Loading SSH Key..."
time="2023-02-23T02:09:55Z" level=debug msg="      Loading Base Domain..."
time="2023-02-23T02:09:55Z" level=debug msg="        Loading Platform..."
time="2023-02-23T02:09:55Z" level=debug msg="      Loading Cluster Name..."
time="2023-02-23T02:09:55Z" level=debug msg="        Loading Base Domain..."
time="2023-02-23T02:09:55Z" level=debug msg="        Loading Platform..."
time="2023-02-23T02:09:55Z" level=debug msg="      Loading Networking..."
time="2023-02-23T02:09:55Z" level=debug msg="        Loading Platform..."
time="2023-02-23T02:09:55Z" level=debug msg="      Loading Pull Secret..."
time="2023-02-23T02:09:55Z" level=debug msg="      Loading Platform..."
time="2023-02-23T02:09:55Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: [platform.vsphere.failureDomains.topology.computeCluster: Required value: must specify a computeCluster, platform.vsphere.failureDomains.topology.resourcePool: Invalid value: \"//Resources\": full path of resource pool must be provided in format /<datacenter>/host/<cluster>/...]"

Expected results:

Deploy the cluster without error

Additional info:

Adding `cluster` to install config : https://github.com/red-hat-storage/ocs-ci/pull/7165/files Seems helped to W/A the issue.

Description of problem:

We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .

This RBAC was only used 4.2 and 4.3 for

  • for making a switch from configMaps to leases in leader election

and we should remove it

 

followup to https://issues.redhat.com/browse/OCPBUGS-3283 - the RBACs are not applied anymore, but we just need to remove the actual files from the repo. No behavioral change should occur with the file removal.

Version-Release number of selected component (if applicable):{code:none}

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of the problem:

it takes the assisted-service few minutes to notice that the cluster installation is completed.

This is because the cluster monitor default interval is 5 minutes.
In order for the cluster to get picked up by a shorter interval loop the assisted-service should update the "trigger_monitoring_timestamp" once the progress reaches 100% or when ingress CA is uploaded.

How reproducible:

Sometimes

Steps to reproduce:

1. Install a cluster 

2. wait for cvo to be available.

3. now wait for assisted-service to set the cluster status to installed

Actual results:

it takes the assisted service few miutes to notice that the cluster is installed

Expected results:
I expeded the service to update the cluster status to installed within few seconds.

Please review the following PR: https://github.com/openshift/oauth-server/pull/114

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

IPI installation failed with master nodes being NotReady and CCM error "alicloud: unable to split instanceid and region from providerID".

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

Always

Steps to Reproduce:

1. try IPI installation on alibabacloud, with credentialsMode being "Manual"
2.
3.

Actual results:

Installation failed.

Expected results:

Installation should succeed.

Additional info:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          34m     Unable to apply 4.12.0-0.nightly-2022-10-05-053337: an unknown error has occurred: MultipleErrors
$ 
$ oc get nodes
NAME                           STATUS     ROLES                  AGE   VERSION
jiwei-1012-02-9jkj4-master-0   NotReady   control-plane,master   30m   v1.25.0+3ef6ef3
jiwei-1012-02-9jkj4-master-1   NotReady   control-plane,master   30m   v1.25.0+3ef6ef3
jiwei-1012-02-9jkj4-master-2   NotReady   control-plane,master   30m   v1.25.0+3ef6ef3
$ 

CCM logs:
E1012 03:46:45.223137       1 node_controller.go:147] node-controller "msg"="fail to find ecs" "error"="cloud instance api fail, alicloud: unable to split instanceid and region from providerID, error unexpected providerID="  "providerId"="alicloud://"
E1012 03:46:45.223174       1 controller.go:317] controller/node-controller "msg"="Reconciler error" "error"="find ecs: cloud instance api fail, alicloud: unable to split instanceid and region from providerID, error unexpected providerID=" "name"="jiwei-1012-02-9jkj4-master-0" "namespace"="" 

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/145768/ (Finished: FAILURE)
10-12 10:55:15.987  ./openshift-install 4.12.0-0.nightly-2022-10-05-053337
10-12 10:55:15.987  built from commit 84aa8222b622dee71185a45f1e0ba038232b114a
10-12 10:55:15.987  release image registry.ci.openshift.org/ocp/release@sha256:41fe173061b00caebb16e2fd11bac19980d569cd933fdb4fab8351cdda14d58e
10-12 10:55:15.987  release architecture amd64

FYI the installation could succeed with 4.12.0-0.nightly-2022-09-28-204419:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/145756/ (Finished: SUCCESS)
10-12 09:59:19.914  ./openshift-install 4.12.0-0.nightly-2022-09-28-204419
10-12 09:59:19.914  built from commit 9eb0224926982cdd6cae53b872326292133e532d
10-12 09:59:19.914  release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc
10-12 09:59:19.914  release architecture amd64

 

 

This is a clone of issue OCPBUGS-8203. The following is the description of the original issue:

When processing an install-config containing either BMC passwords in the baremetal platform config, or a vSphere password in the vsphere platform config, we log a warning message to say that the value is ignored.

This warning currently includes the value in the password field, which may be inconvenient for users reusing IPI configs who don't want their password values to appear in logs.

Description of problem:

In case of CRC we provision the cluster first and the create the disk image out of it and that what we share to our users. Now till now we always remove the pull secret from the cluster after provision it using https://github.com/crc-org/snc/blob/master/snc.sh#L241-L258 and it worked without any issue till 4.11.x but for 4.12.0-rc.1 we are seeing that MCO not able to reconcile.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a single node cluster using cluster bot `launch 4.12.0-rc.1 aws,single-node` 

2. Once cluster is provisioned update the pull secret from the config 

```
$ cat pull-secret.yaml 
apiVersion: v1
data:
  .dockerconfigjson: e30K
kind: Secret
metadata:
  name: pull-secret
  namespace: openshift-config
type: kubernetes.io/dockerconfigjson
$ oc replace -f pull-secret.yaml
```

3. Wait for MCO recocile and you will see failure to reconcile MCO

Actual results:

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-66086aa249a9f92b773403f7c3745ea4   False     True       True       1              0                   0                     1                      94m
worker   rendered-worker-0c07becff7d3c982e24257080cc2981b   True      False      False      0              0                   0                     0                      94m


$ oc get co machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.12.0-rc.1   True        False         True       93m     Failed to resync 4.12.0-rc.1 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 0)]

$ oc logs machine-config-daemon-nf9mg -n openshift-machine-config-operator
[...]
I1123 15:00:37.864581   10194 run.go:19] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba
Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: (Mirrors also failed: [quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
W1123 15:00:39.186103   10194 run.go:45] podman failed: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba failed: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: (Mirrors also failed: [quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
: exit status 125; retrying...

Expected results:

 

Additional info:

 

Description of the problem:

When invoking V2GetClusterInstallConfig API (/v2/clusters/{cluster_id}/install-config) the returned config doesn't include network lists (in 'networking' key).
I.e. these lists are missing: clusterNetwork/machineNetwork/serviceNetwork

To issue is in the cluster fetching mechanism from DB, i.e. it is fetched without eager loading.

How reproducible:

100%

Steps to reproduce:

1. Invoke '/v2/clusters/{cluster_id}/install-config'

Actual results:

Missing network lists.

Expected results:

Network lists should be populated (clusterNetwork/machineNetwork/serviceNetwork).

Description of problem:

library-go should use Lease for leader election by default. 
In 4.10 we switched from configmaps to configmapsleases, now we can switch to leases

change library-go to use lease by default, we already have an open pr for that: https://github.com/openshift/library-go/pull/1448 

once the pr merges, we should revendor library-go for:
- kas operator
- oas operator
- etcd operator
- kcm operator
- openshift controller manager operator
- scheduler operator
- auth operator
- cluster policy controller
 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

TargetDown alert fired while it shouldn't.
Prometheus endpoints are not always properly unregistered and the alert will therefore think that some Kube service endpoints are down

Version-Release number of selected component (if applicable):

The problem as always been there.

How reproducible:

Not reproducible.
Most of the time Prometheus endpoints are properly unregistered.
Aim here is to get the TargetDown Prometheus expression be more resilient; this can be tested on past metrics data in which the unregistration issue was encountered.

Steps to Reproduce:

N/A

Actual results:

TargetDown alert triggered while Kube service endpoints are all up & running.

Expected results:

TargetDown alert should not have been trigerred.

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-1684. The following is the description of the original issue:

Description of problem:

After an upgrade from 4.9 to 4.10 collect+ process causing  CPU bursts of 5-6 seconds every 15 minutes regularly. During each burst collect+ consume 100% CPU.

Top Command Dump Sample:
top - 07:00:04 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.3 us,  4.5 sy,  0.0 ni, 80.8 id,  7.4 wa,  0.8 hi,  0.3 si,  0.0 st
MiB Mem :  32151.9 total,  22601.4 free,   2182.1 used,   7368.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29420.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2009 root      20   0 3741252 172136  71396 S  12.9   0.5  36:42.79 kubelet
   1954 root      20   0 2663680 130928  46156 S   7.9   0.4   6:50.44 crio
   9440 root      20   0 1633728 546036  60836 S   7.9   1.7  21:06.80 fluentd
      1 root      20   0  238416  15412   8968 S   5.9   0.0   1:56.73 systemd
   1353 800       10 -10  796808 165380  40916 S   5.0   0.5   2:32.11 ovs-vsw+
   5454 root      20   0 1729112  73680  37404 S   2.0   0.2   3:52.21 coredns
1061248 1000360+  20   0 1113524  24304  17776 S   2.0   0.1   0:00.03 collect+
    306 root       0 -20       0      0      0 I   1.0   0.0   0:00.37 kworker+
    957 root      20   0  264076 126280 119596 S   1.0   0.4   0:06.80 systemd+
   1114 dbus      20   0   83188   6224   5140 S   1.0   0.0   0:04.30 dbus-da+
   5710 root      20   0  406004  31384  15068 S   1.0   0.1   0:04.11 tuned
   6198 nobody    20   0 1632272  46588  20516 S   1.0   0.1   0:17.60 network+
1061291 1000650+  20   0   11896   2748   2496 S   1.0   0.0   0:00.01 bash
1061355 1000650+  20   0   11896   2868   2616 S   1.0   0.0   0:00.01 bashtop - 07:00:05 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 248 total,   2 running, 245 sleeping,   0 stopped,   1 zombie
%Cpu(s): 11.4 us,  2.0 sy,  0.0 ni, 81.5 id,  4.2 wa,  0.6 hi,  0.2 si,  0.0 st
MiB Mem :  32151.9 total,  22601.4 free,   2182.1 used,   7368.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29420.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  36464  21300 S  74.3   0.1   0:00.78 collect+
   9440 root      20   0 1633728 545412  60900 S  11.9   1.7  21:06.92 fluentd
   2009 root      20   0 3741252 172396  71396 S   4.0   0.5  36:42.83 kubelet
      1 root      20   0  238416  15412   8968 S   1.0   0.0   1:56.74 systemd
    300 root       0 -20       0      0      0 I   1.0   0.0   0:00.46 kworker+
   1427 root      20   0   19656   2204   2064 S   1.0   0.0   0:01.55 agetty
   2419 root      20   0 1714748  38812  22884 S   1.0   0.1   0:24.42 coredns+
   2528 root      20   0 1634680  36464  20628 S   1.0   0.1   0:22.01 dynkeep+
1009372 root      20   0       0      0      0 I   1.0   0.0   0:00.42 kworker+
1053353 root      20   0   50200   4012   3292 R   1.0   0.0   0:01.56 toptop - 07:00:06 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.3 us,  1.5 sy,  0.0 ni, 82.7 id,  0.1 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  32151.9 total,  22595.9 free,   2185.7 used,   7370.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29416.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  35740  21428 S  99.0   0.1   0:01.78 collect+
   2009 root      20   0 3741252 172396  71396 S   3.0   0.5  36:42.86 kubelet
   9440 root      20   0 1633728 545076  60900 S   2.0   1.7  21:06.94 fluentd
   1353 800       10 -10  796808 165380  40916 S   1.0   0.5   2:32.12 ovs-vsw+
   1954 root      20   0 2663680 131452  46156 S   1.0   0.4   6:50.45 crio top - 07:00:07 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   1 running, 246 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.7 us,  1.1 sy,  0.0 ni, 83.6 id,  0.1 wa,  0.4 hi,  0.1 si,  0.0 st
MiB Mem :  32151.9 total,  22595.9 free,   2185.7 used,   7370.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29416.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  35236  21492 S 102.0   0.1   0:02.80 collect+
   2009 root      20   0 3741252 172660  71396 S   7.0   0.5  36:42.93 kubelet
   3288 nobody    20   0  718964  30648  11680 S   3.0   0.1   3:36.84 node_ex+
      1 root      20   0  238416  15412   8968 S   1.0   0.0   1:56.75 systemd
   1353 800       10 -10  796808 165380  40916 S   1.0   0.5   2:32.13 ovs-vsw+
   1954 root      20   0 2663680 131452  46156 S   1.0   0.4   6:50.46 crio
   5454 root      20   0 1729112  73680  37404 S   1.0   0.2   3:52.22 coredns
   9440 root      20   0 1633728 545080  60900 S   1.0   1.7  21:06.95 fluentd
1053353 root      20   0   50200   4012   3292 R   1.0   0.0   0:01.57 toptop - 07:00:08 up 10:10,  0 users,  load average: 0.20, 0.24, 0.27
Tasks: 247 total,   2 running, 245 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.2 us,  0.9 sy,  0.0 ni, 84.5 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  32151.9 total,  22595.9 free,   2185.7 used,   7370.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29416.7 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1061248 1000360+  20   0 1484936  35164  21492 S 100.0   0.1   0:03.81 collect+
   2009 root      20   0 3741252 172660  71396 S   3.0   0.5  36:42.96 kubelet
1061543 1000650+  20   0   34564   9804   5772 R   3.0   0.0   0:00.03 python
   9440 root      20   0 1633728 543952  60900 S   2.0   1.7  21:06.97 fluentd
1053353 root      20   0   50200   4012   3292 R   2.0   0.0   0:01.59 top
   2330 root      20   0 1654612  61260  34720 S   1.0   0.2   0:55.81 coredns
   8023 root      20   0   12056   3044   2580 S   1.0   0.0   0:24.59 install+top - 07:00:09 up 10:10,  0 users,  load average: 0.34, 0.27, 0.28
Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.9 us,  3.2 sy,  0.0 ni, 85.6 id,  1.5 wa,  0.5 hi,  0.2 si,  0.0 st
MiB Mem :  32151.9 total,  22621.0 free,   2160.5 used,   7370.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29441.9 avail Mem     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2009 root      20   0 3741252 172660  71396 S   5.0   0.5  36:43.01 kubelet
   9440 root      20   0 1633728 542684  60900 S   4.0   1.6  21:07.01 fluentd
   1353 800       10 -10  796808 165380  40916 S   2.0   0.5   2:32.15 ovs-vsw+
      1 root      20   0  238416  15412   8968 S   1.0   0.0   1:56.76 systemd
   1954 root      20   0 2663680 131452  46156 S   1.0   0.4   6:50.47 crio
   5454 root      20   0 1729112  73680  37404 S   1.0   0.2   3:52.23 coredns
   6198 nobody    20   0 1632272  45936  20516 S   1.0   0.1   0:17.61 network+
   7016 root      20   0   12052   3204   2736 S   1.0   0.0   0:24.19 install+

Version-Release number of selected component (if applicable):

 

How reproducible:

Lab environment does not present same behavior.

Steps to Reproduce:

1.
2.
3.

Actual results:

Regular high CPU spikes

Expected results:

No CPU spikes

Additional info:

Provided logs:
1-) top command dump uploaded to SF case 03317387
2-) must-gather uploaded to SF case 03317387

 

Description of problem:

CVO recently introduced a new precondition RecommendedUpdate[1]. While we request an upgrade to a version which is not an available update, the precondition got UnknownUpdate and blocks the upgrade.

# oc get clusterversion/version -ojson | jq -r '.status.availableUpdates'null

# oc get clusterversion/version -ojson | jq -r '.status.conditions[]|select(.type == "ReleaseAccepted")'
{
  "lastTransitionTime": "2022-10-20T08:16:59Z",
  "message": "Preconditions failed for payload loaded version=\"4.12.0-0.nightly-multi-2022-10-18-153953\" image=\"quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695\": Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"UnknownUpdate\": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.0-0.nightly-multi-2022-10-18-091108 to 4.12.0-0.nightly-multi-2022-10-18-153953 is unknown.",
  "reason": "PreconditionChecks",
  "status": "False",
  "type": "ReleaseAccepted"
}


[1]https://github.com/openshift/cluster-version-operator/pull/841/

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-multi-2022-10-18-091108

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.12 cluster
2. Upgrade to a version which is not in the available update
# oc adm upgrade --allow-explicit-upgrade --to-image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anywayRequesting update to release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695 

Actual results:

CVO precondition check fails and blocks upgrade

Expected results:

Upgrade proceeds

Additional info:

 

Description of problem:

Customer is not able anymore to provision new baremetal nodes in 4.10.35 using the same rootDeviceHints used in 4.10.10.
Customer uses HP DL360 Gen10, with exteranal SAN storage that is seen by the system as a multipath device. Latest IPA versions are implementing some changes to avoid wiping shared disks and this seems to affect what we should provide as rootDeviceHints.
They used to put /dev/sda as rootDeviceHints, in 4.10.35 it doesn't make the IPA write the image to the disk anymore because it sees the disk as part of a multipath device, we tried using the on top multipath device /dev/dm-0, the system is then able to write the image to the disk but then it gets stuck when it tried to issue a partprobe command, rebooting the systems to boot from the disk does not seem to help complete the provisioning, no workaround so far.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

by trying to provisioning a baremetal node with a multipath device.

Steps to Reproduce:

1. Create a new BMH using a multipath device as rootDeviceHints
2.
3.

Actual results:

The node does not get provisioned

Expected results:

the node gets provisioned correctly

Additional info:

 

Description of problem:

On clusters serving Route via CRD (i.e. MicroShift), .spec.host values are not automatically assigned during Route creation, as they are on OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

$ cat<<EOF | oc apply --server-side -f-
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-microshift
spec:
  to:
    kind: Service
    name: hello-microshift
EOF

route.route.openshift.io/hello-microshift serverside-applied

$ oc get route hello-microshift -o yaml

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  annotations:
    openshift.io/host.generated: "true"
  creationTimestamp: "2022-11-11T23:53:33Z"
  generation: 1
  name: hello-microshift
  namespace: default
  resourceVersion: "2659"
  uid: cd35cd20-b3fd-4d50-9912-f34b3935acfd
spec:
  host: hello-microshift-default.cluster.local
  to:
    kind: Service
    name: hello-microshift
  wildcardPolicy: None
 

Expected results:

...
metadata:
  annotations:
    openshift.io/host.generated: "true"
...
spec:
  host: hello-microshift-default.foo.bar.baz
...

Actual results:

Host and host.generated annotation are missing.

Additional info:

** This change will be inert on OCP, which already has the correct behavior. **

Description of problem:

Installer fails to install 4.12.0-rc.0 on VMware IPI with the script that worked with prior OCP versions.
Error happens during Terraform prepare step when gathering information in the "Platform Provisioning Check". It looks like a permission issue, but we're using the VCenter administrator account. I double checked and that account has all the necessary permissions.

Version-Release number of selected component (if applicable):

OCP installer 4.12.0-rc.0
VSphere & Vcenter 7.0.3 - no pending updates

How reproducible:

always - we observed this already in the nightlies, but wanted to wait for a RC to confirm

Steps to Reproduce:

1. Try to install using the openshift-install binary

Actual results:

Fails during the preparation step

Expected results:

Installs the cluster ;)

Additional info:

This runs in our CICD pipeline, let me know if you want to need access to the full run log:
https://gitlab.consulting.redhat.com/cblum/storage-ocs-lab/-/jobs/219304

This includes the install-config.yaml, all component versions and the full debug log output

Because of recent changes to the MCO's Makefile, failing tests were not causing the CI jobs to fail. In its current state, this is problematic as code which fails tests could get merged. Slightly related to this, the e2e-gcp-op tests are continually timing out. Until we can get to https://issues.redhat.com/browse/MCO-160 to speed them up, we should increase the timeout for those tests.

Description of problem:

OLM has a dependency on openshift/cluster-policy-controller. This project had dependencies with v0.0.0 versions, which due to a bug in ART was causing issues building the olm image. To fix this, we have to update the dependencies in the cluster-policy-controller project to point to actual versions.

This was already done:
 * https://github.com/openshift/cluster-policy-controller/pull/103
 * https://github.com/openshift/cluster-policy-controller/pull/101

And these changes already made it to 4.14 and 4.13 branches of the cluster-policy-controller.

The backport to 4.12 is: https://github.com/openshift/cluster-policy-controller/pull/102

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7262. The following is the description of the original issue:

When the UI is active in the console events messages that are generated will distort the interface and make it difficult for the user to view the configuration and select options. An example is shown in the attached screenshot.

Description of problem
`oc-mirror` does not work as expected with relative path for OCI format copy

How reproducible:
always

Steps to Reproduce:
Copy the operator image with OCI format to localhost with relative path my-oci-catalog;
cat imageset-copy.yaml
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
operators:

  • catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
    packages:
  • name: aws-load-balancer-operator

`oc mirror --config imageset-copy.yaml --use-oci-feature --oci-feature-action=copy oci://my-oci-catalog`

Actual results:
2. will create a dir with name : my-file-catalog , but no use for user specified dir: my-oci-catalog
ls -tl
total 20
drwxr-xr-x. 3 root root 4096 Dec 6 13:58 oc-mirror-workspace
drwxr-xr-x. 3 root root 4096 Dec 6 13:58 olm_artifacts
drwxr-x---. 3 root root 4096 Dec 6 13:58 my-file-catalog
drwxr-xr-x. 2 root root 4096 Dec 6 13:58 my-oci-catalog
rw-rr-. 1 root root 206 Dec 6 12:39 imageset-copy.yaml

Expected results:
2. Use the user specified directory .

Additional info:
``oc-mirror --config config-operator.yaml oci:///home/ocmirrortest/noo --use-oci-feature --oci-feature-action=copy` with full path works well.

Description of problem:

When a user logs in for the first time and ACM has already been installed, the initial perspective is not "All Clusters" for fleet management.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Log in as a new user after installing ACM

Actual results:

Conventional OpenShift console dashboard is displayed

Expected results:

All Clusters perspective should be displayed

Additional info:

Affects 4.11, 4.12

This ticket is linked with

https://issues.redhat.com/browse/SDA-8177
https://issues.redhat.com/browse/SDA-8178

As a summary, a base domain for a hosted cluster may already contain the "cluster-name".

But it seems that Hypershift also encodes it during some reconciliation step:

https://github.com/openshift/hypershift/blob/main/support/globalconfig/dns.go#L20

Then when using a DNS base domain like:

"rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

we will have A records like:

"*.apps.lponce-prod-01.rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

The expected behaviour would be that given a DNS base domain:

"rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

The resulting wildcard for Ingress would be:

"*.apps.rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

Note that trying to configure a specific IngressSpec for a hosted cluster didn't work for our case, as the wildcards records are not created.

Description of problem:

Right border radius is 0 for the pipeline visualization wrapper in dark mode but looks fine in light mode

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Switch the theme to dark mode
2. Create a pipeline and navigate to the Pipeline details page

Actual results:

Right border radius is 0, see the screenshots

Expected results:

Right border radius should be same as left border radius.

Additional info:

 

Description of problem:

When we introduced aarch64 support for IPI on Azure, we changed the Installer from using managed images (no architecture support) to using Image Galleries (architecture support). This means that the place where the Installer looks for rhcos bootimages has changed from "/resourceGroups/$rg_name/providers/Microsoft.Compute/images/$cluster_id" to "/resourceGroups/$rg_name/providers/Microsoft.Compute/galleries/gallery_$cluster_id/images/$cluster_id/versions/$rhcos_version".
This has been properly handled in the IPI workflow, with changes to the terraform configs [1]. However, our ARM template for UPI installs [2] still uploads images via Managed Images and therefore breaks workflows provisioning compute nodes with MAO.

[1] https://github.com/openshift/installer/pull/6304
[2] https://github.com/openshift/installer/blob/release-4.12/upi/azure/02_storage.json

Version-Release number of selected component (if applicable):

4.13 and 4.12

How reproducible:

always

Steps to Reproduce:

Any workflow that provisions compute nodes with MAO. For example, in the UPI deploy with ARM templates:
1. Execute 06_workers.json template with compute.replicas: 0 in the install-config, then run the oc scale command to "activate" MAO provision (`oc scale --replicas=1 machineset $machineset_name -n openshift-machine-api`)
2. Skip 06_workers.json but set compute.replicas: 3 in the install-config. MAO will provision nodes as part of the cluster deploy.

Actual results:

Error Message:           failed to reconcile machine 
"maxu-upi2-gc7n8-worker-eastus3-68gdx": failed to create vm 
maxu-upi2-gc7n8-worker-eastus3-68gdx: failure sending request for 
machine maxu-upi2-gc7n8-worker-eastus3-68gdx: cannot create vm: 
compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: 
StatusCode=404 -- Original Error: Code="GalleryImageNotFound" 
Message=""The gallery image 
/subscriptions/53b8f551-.../resourceGroups/maxu-upi2-gc7n8-rg/providers/Microsoft.Compute/galleries/gallery_maxu_upi2_gc7n8/images/maxu-upi2-gc7n8-gen2/versions/412.86.20220930
 is not available in eastus region. Please contact image owner to 
replicate to this region, or change your requested region."" 
Target="imageReference"

But the image can be found at:
/subscriptions/53b8f551-.../resourceGroups/maxu-upi2-gc7n8-rg/providers/Microsoft.Compute/images/maxu-upi2-gc7n8-gen2

Expected results:

No errors and the bootimage is loaded from the Image Gallery.

Additional info:

02_storage.json template will have to be rewritten to use Image Gallery instead of Managed Images.

Should always delete the must-gather pod when run the must-gather

Description of problem:
After run the `oc adm must-gather` with run-namespace , will leave the pod alone. Should always remove it .

Oc
h/h

Version :
oc version --client
Client Version: 4.13.0-0.nightly-2022-12-18-222329

How Reproducible:
Always

Steps to reproduce:
Run the command :
`oc adm must-gather --run-namespace='openshift-multus'`

Actual result:
When the command completed , the pod will leave there .
oc get pod -A |grep must-gather
openshift-multus must-gather-pcb6g 1/2 NotReady 0 30m

Expeted result:
Should always delete the must-gather pod when job completed.

This is a clone of issue OCPBUGS-10032. The following is the description of the original issue:

Description of problem:

test "operator conditions control-plane-machine-set" fails https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216
control-plane-machine-set operator is Unavailable, because it doesn't reconcile node events. If a node becomes ready later than the referencing Machine, Node update event will not trigger reconciliation.

Version-Release number of selected component (if applicable):

 

How reproducible:

depends on the sequence of Node vs Machine events

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

operator logs 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-machine-api_control-plane-machine-set-operator-5d5848c465-g4q2p_control-plane-machine-set-operator.log

machines 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/machines.json

nodes 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1574/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade/1634410710559625216/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json

Description of problem:

deletes operand: Using OLM descriptor components deletes operand 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

Search: https://search.ci.openshift.org/?search=Using+OLM+descriptor+components+deletes+operand&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Prow: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/12422/pull-ci-openshift-console-master-e2e-gcp-console/1614020742284840960

Artifacts: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/12422/pull-ci-openshift-console-master-e2e-gcp-console/1614020742284840960/artifacts/e2e-gcp-console/test/artifacts/gui_test_screenshots/cypress/screenshots/descriptors.spec.ts/1_Using%20OLM%20descriptor%20components%20--%20deletes%20operand%20(failed).png 

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-11636. The following is the description of the original issue:

Description of problem:

The ACLs are disabled for all newly created s3 buckets, this causes all OCP installs to fail: the bootstrap ignition can not be uploaded:

level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {


Version-Release number of selected component (if applicable):

4.11+
 

How reproducible:

Always
 

Steps to Reproduce:

1.Create a cluster via IPI

Actual results:

install fail
 

Expected results:

install succeed
 

Additional info:

Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023 - https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-error-responses.html - After you apply the bucket owner enforced setting for Object Ownership, ACLs are disabled.

 

Description of problem:
This is a follow-up on OCPBUGS-2579, where Prabhu fixed a similar issue for catalog items "Helm Charts" and "Samples" and a follow-up on OCPBUGS-4012 where Jai fied this for Serverless actions "Event Sink", "Event Source", "Channel" and "Broken".

But one "Event source" leftover is still shown when drag and drop the arrow from a Knative Service.

Version-Release number of selected component (if applicable):
4.13, earlier versions have the same issue

How reproducible:
Always

Steps to Reproduce:
1. Install Serverless operator and create Eventing and Serving resources
2. Import an application (Developer perspective > add > container image) and create a Knative Service
3. Open customization (path /cluster-configuration/) and disable all add actions
4. Wait some seconds and check that the Developer perspective > Add page shows no items
5. Navigate to topology perspective and drag and drop the arrow from a Knative Service to an empty area.

Actual results:
"Event Source" is shown

Expected results:
"Event Source" should not be shown

Additional info:
Follow up on OCPBUGS-2579 and OCPBUGS-4012

Description of problem:

Machines associated with the control plane are in a "failed" state.  This issue appeared to crop up on or around the time of nightly 4.13.0-0.nightly-2023-02-13-235211.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-13-235211 and later

How reproducible:

consistently in CI

Steps to Reproduce:

1.
2.
3.

Actual results:

Machines are in a failed state

Expected results:

Machines should not be in a failed state

Additional info:

search.ci https://search.ci.openshift.org/?search=Managed+cluster+should+have+machine+resources&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Job history for 4.13-e2e-vsphere-ovn-upi: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-upi

This is a clone of issue OCPBUGS-13549. The following is the description of the original issue:

Description of problem:

Incorrect AWS ARN [1] is used for GovCloud and AWS China regions, which will cause the command `ccoctl aws create-all` to fail:

Failed to create Identity provider: failed to apply public access policy to the bucket ci-op-bb5dgq54-77753-oidc: MalformedPolicy: Policy has invalid resource
	status code: 400, request id: VNBZ3NYDH6YXWFZ3, host id: pHF8v7C3vr9YJdD9HWamFmRbMaOPRbHSNIDaXUuUyrgy0gKCO9DDFU/Xy8ZPmY2LCjfLQnUDmtQ=

Correct AWS ARN prefix:
GovCloud (us-gov-east-1 and us-gov-west-1): arn:aws-us-gov
AWS China (cn-north-1 and cn-northwest-1): arn:aws-cn

[1] https://github.com/openshift/cloud-credential-operator/pull/526/files#diff-1909afc64595b92551779d9be99de733f8b694cfb6e599e49454b380afc58876R211


 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-05-11-024616

How reproducible:

Always
 

Steps to Reproduce:

1. Run command: `aws create-all --name="${infra_name}" --region="${REGION}" --credentials-requests-dir="/tmp/credrequests" --output-dir="/tmp"` on GovCloud regions
2.
3.

Actual results:

Failed to create Identity provider
 

Expected results:

Create resources successfully.
 

Additional info:

Related PRs:
4.10: https://github.com/openshift/cloud-credential-operator/pull/531
4.11: https://github.com/openshift/cloud-credential-operator/pull/530
4.12: https://github.com/openshift/cloud-credential-operator/pull/529
4.13: https://github.com/openshift/cloud-credential-operator/pull/528
4.14: https://github.com/openshift/cloud-credential-operator/pull/526
 

Description of problem:

Customer reported that after upgrading from 4.10.32 to 4.10.33 the image registry operator is panicking. After some investigation it is clear that the panic is coming from the check for .spec.suspend.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

We are observing[1] intermittent flakiness of the `deploy clusterDeployment with agent and invalid ignition config` subsystem test with failure

 [kube-api]cluster installation
/assisted-service/subsystem/kubeapi_test.go:819
  deploy clusterDeployment with agent and invalid ignition config [It]
  /assisted-service/subsystem/kubeapi_test.go:1715
  Timed out after 120.000s.
              
  Expected
      <string>: UnapprovedAgents
  to equal
      <string>: UnsyncedAgents
  /assisted-service/subsystem/kubeapi_test.go:483 

Around the time that feature[2] was introduced we were fixing some race conditions in the KubeAPI, but it looks like this code path is still prone to intermittently fail.

It does not happen very frequently, but we still observe the issue every now and then so it is worth looking into it.

This is a clone of issue OCPBUGS-11112. The following is the description of the original issue:

Description of problem: The openshift-manila-csi-driver namespace should have the "workload.openshift.io/allowed= management" label.

This is currently not the case:

❯ oc describe ns openshift-manila-csi-driver  
Name:         openshift-manila-csi-driver
Labels:       kubernetes.io/metadata.name=openshift-manila-csi-driver
              pod-security.kubernetes.io/audit=privileged
              pod-security.kubernetes.io/enforce=privileged
              pod-security.kubernetes.io/warn=privileged
Annotations:  include.release.openshift.io/self-managed-high-availability: true
              openshift.io/node-selector: 
              openshift.io/sa.scc.mcs: s0:c24,c4
              openshift.io/sa.scc.supplemental-groups: 1000560000/10000
              openshift.io/sa.scc.uid-range: 1000560000/10000
Status:       Active

No resource quota.

No LimitRange resource.

It is causing CI jobs to fail with:

{  fail [github.com/openshift/origin/test/extended/cpu_partitioning/platform.go:82]: projects [openshift-manila-csi-driver] do not contain the annotation map[workload.openshift.io/allowed:management]
Expected
    <[]string | len:1, cap:1>: [
        "openshift-manila-csi-driver",
    ]
to be empty
Ginkgo exit error 1: exit with code 1}

For instance https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27831/pull-ci-openshift-origin-release-4.13-e2e-openstack-ovn/1641317874201006080.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

Delete/Add a failureDomain in CPMS to trigger update cannot work right on GCP

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2022-11-19-182111

How reproducible:

always

Steps to Reproduce:

1.Launch a 4.13 cluster on GCP
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-11-19-182111   True        False         80m     Cluster version is 4.13.0-0.nightly-2022-11-19-182111
liuhuali@Lius-MacBook-Pro huali-test % oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.huliu-gcp13c2.qe.gcp.devcluster.openshift.com:6443".
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                 PHASE     TYPE            REGION        ZONE            AGE
huliu-gcp13c2-6sh7k-master-0         Running   n2-standard-4   us-central1   us-central1-a   102m
huliu-gcp13c2-6sh7k-master-1         Running   n2-standard-4   us-central1   us-central1-b   102m
huliu-gcp13c2-6sh7k-master-2         Running   n2-standard-4   us-central1   us-central1-c   102m
huliu-gcp13c2-6sh7k-worker-a-8sftf   Running   n2-standard-4   us-central1   us-central1-a   99m
huliu-gcp13c2-6sh7k-worker-b-zb48r   Running   n2-standard-4   us-central1   us-central1-b   99m
huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running   n2-standard-4   us-central1   us-central1-c   99m
liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
huliu-gcp13c2-6sh7k-worker-a   1         1         1       1           102m
huliu-gcp13c2-6sh7k-worker-b   1         1         1       1           102m
huliu-gcp13c2-6sh7k-worker-c   1         1         1       1           102m
huliu-gcp13c2-6sh7k-worker-f   0         0                             102m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE      AGE
cluster   3         3         3       3                       Inactive   99m

2.Edit CPMS, change state to Active
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   100m 

3.Edit CPMS, there are four failureDomains(us-central1-a,us-central1-b,us-central1-c,us-central1-f) by default, delete the first one(us-central1-a), found the new machine stuck in Provisioning

liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                 PHASE          TYPE            REGION        ZONE            AGE
huliu-gcp13c2-6sh7k-master-0         Running        n2-standard-4   us-central1   us-central1-a   104m
huliu-gcp13c2-6sh7k-master-1         Running        n2-standard-4   us-central1   us-central1-b   104m
huliu-gcp13c2-6sh7k-master-2         Running        n2-standard-4   us-central1   us-central1-c   104m
huliu-gcp13c2-6sh7k-master-gb5b4-0   Provisioning                                                 3s
huliu-gcp13c2-6sh7k-worker-a-8sftf   Running        n2-standard-4   us-central1   us-central1-a   101m
huliu-gcp13c2-6sh7k-worker-b-zb48r   Running        n2-standard-4   us-central1   us-central1-b   101m
huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running        n2-standard-4   us-central1   us-central1-c   101m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                 PHASE          TYPE            REGION        ZONE            AGE
huliu-gcp13c2-6sh7k-master-0         Running        n2-standard-4   us-central1   us-central1-a   131m
huliu-gcp13c2-6sh7k-master-1         Running        n2-standard-4   us-central1   us-central1-b   131m
huliu-gcp13c2-6sh7k-master-2         Running        n2-standard-4   us-central1   us-central1-c   131m
huliu-gcp13c2-6sh7k-master-gb5b4-0   Provisioning   n2-standard-4   us-central1   us-central1-f   26m
huliu-gcp13c2-6sh7k-worker-a-8sftf   Running        n2-standard-4   us-central1   us-central1-a   127m
huliu-gcp13c2-6sh7k-worker-b-zb48r   Running        n2-standard-4   us-central1   us-central1-b   127m
huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running        n2-standard-4   us-central1   us-central1-c   127m

machine-controller log:
E1121 05:10:15.654929       1 actuator.go:53] huliu-gcp13c2-6sh7k-master-gb5b4-0 error: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound
E1121 05:10:15.655015       1 controller.go:315] huliu-gcp13c2-6sh7k-master-gb5b4-0: error updating machine: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound, retrying in 30s seconds
I1121 05:10:15.655829       1 recorder.go:103] events "msg"="huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-gcp13c2-6sh7k-master-gb5b4-0","uid":"008cbb45-2b29-493e-8985-37f87fe6a98d","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"60780"} "reason"="FailedUpdate" "type"="Warning" 

4.Edit CPMS, add the failureDomain(us-central1-a) back, found the machine stuck in Deleting

liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster   controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                 PHASE      TYPE            REGION        ZONE            AGE
huliu-gcp13c2-6sh7k-master-0         Running    n2-standard-4   us-central1   us-central1-a   3h37m
huliu-gcp13c2-6sh7k-master-1         Running    n2-standard-4   us-central1   us-central1-b   3h37m
huliu-gcp13c2-6sh7k-master-2         Running    n2-standard-4   us-central1   us-central1-c   3h37m
huliu-gcp13c2-6sh7k-master-gb5b4-0   Deleting   n2-standard-4   us-central1   us-central1-f   113m
huliu-gcp13c2-6sh7k-worker-a-8sftf   Running    n2-standard-4   us-central1   us-central1-a   3h34m
huliu-gcp13c2-6sh7k-worker-b-zb48r   Running    n2-standard-4   us-central1   us-central1-b   3h34m
huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running    n2-standard-4   us-central1   us-central1-c   3h34m

Actual results:

When delete a failureDomain, the new machine stuck in Provisioning, when add the failureDomain back, the new machine stuck in Deleting

Expected results:

When delete a failureDomain, the new machine should get Running, when add the failureDomain back, the new machine should be deleted successfully,
Or if the machine cannot be created in the failureDomain, the new machine should be Failed when delete a failureDomain, and the machine should be deleted successfully when add the failureDomain back.

Additional info:

Must-gather: 
https://drive.google.com/file/d/1AxnVwToQ15g6M4Mc5S7rh62FygM44B6f/view?usp=sharing

worker machine created successfully in this failureDomain:
huliu-gcp13c2-6sh7k-worker-f-g5h77   Running    n2-standard-4   us-central1   us-central1-f   8m36s

This is a clone of issue OCPBUGS-5129. The following is the description of the original issue:

Description of problem:

I attempted to install a BM SNO with the agent based installer.
In the install_config, I disabled all supported capabilities except marketplace. Install_config snippet: 

capabilities:
  baselineCapabilitySet: None
  additionalEnabledCapabilities:
  - marketplace

The system installed fine but the capabilities config was not passed down to the cluster. 

clusterversion: 
status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - Storage
      - baremetal
      - marketplace
      - openshift-samples
      knownCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - Storage
      - baremetal
      - marketplace
      - openshift-samples

oc -n kube-system get configmap cluster-config-v1 -o yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: ptp.lab.eng.bos.redhat.com
    bootstrapInPlace:
      installationDisk: /dev/disk/by-id/wwn-0x62cea7f04d10350026c6f2ec315557a0
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform: {}
      replicas: 0
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform: {}
      replicas: 1
    metadata:
      creationTimestamp: null
      name: cnfde8
    networking:
      clusterNetwork:
      - cidr: 10.128.0.0/14
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.16.231.0/24
      networkType: OVNKubernetes
      serviceNetwork:
      - 172.30.0.0/16
    platform:
      none: {}
    publish: External
    pullSecret: ""





Version-Release number of selected component (if applicable):

4.12.0-rc.5

How reproducible:

100%

Steps to Reproduce:

1. Install SNO with agent based installer as described above
2.
3.

Actual results:

Capabilities installed  

Expected results:

Capabilities not installed 

Additional info:

 

This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:

Description of problem:

Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build.

This is resulting in all jobs invoked by the 4.12 nightlies failing to install.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-13-172313 and later

How reproducible:

consistently in 4.12 nightlies only(ci builds do not seem to be impacted).

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log

This is a clone of issue OCPBUGS-5548. The following is the description of the original issue:

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Deployment, DeploymentConfig, or Knative Service with enabled Pipeline, and then deleting it again with the enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the automatically created Pipeline is not deleted.

When the user tries to create the same resource with a Pipeline again this fails with an error:

An error occurred
secrets "nodeinfo-generic-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5547)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Pipelines operator (tested with 1.8.2)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. Case 1: In the topology select the new resource and delete it
  5. Case 2: In the topology select the application group and delete the complete app

Actual results:
Case 1: Delete resources:

  1. Deployment (tries it twice!) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Case 2: Delete application:

  1. Deployment (just once) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Expected results:
Case 1: Delete resource:

  1. Delete Deployment $name should be called just once
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Case 2: Delete application:

  1. (Keep this deletion) Deployment $name
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Additional info:

Description of problem:

Azure VIP 168.63.129.16 needs to be noProxy to let a VM report back about it's creation status [1]. A similar thing needs to be done for the armEndpoint of ASH - to make sure that future cluster nodes do not communicate with a Stack Hub API through proxy

[1] https://docs.microsoft.com/en-us/azure/virtual-network/what-is-ip-address-168-63-129-16

Version-Release number of selected component (if applicable):

4.10.20

How reproducible:

Need to have a proxy server in ASH and run the installer

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected these two be auto-added as they are very specific and difficult to troubleshoot

Expected results:

 

Additional info:

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2104997 against the cluster-network-operator since the fix involves changing both the operator and the installer.

Description of problem:

Whenever a MC that needs a reboot is applied to a MachineConfigPool, the pool becomes degraded during the time that the node is rebooting.


Version-Release number of selected component (if applicable):

Baremetal IPI dual stack cluster

FLEXY TEMPLATE: private-templates/functionality-testing/aos-4_13/ipi-on-baremetal/versioned-installer-packet_libvirt-dual_stack-ci

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-02-21-014524   True        False         3h2m    Cluster version is 4.13.0-0.nightly-2023-02-21-014524

How reproducible:

Very often

Steps to Reproduce:

1. Create a MC that needs to reboot the nodes
2. Eventually (quite often) the MCP will become degraded reporting this error
            {
                "lastTransitionTime": "2023-02-22T15:44:34Z",
                "message": "Node worker-0.rioliu-0222c.qe.devcluster.openshift.com is reporting: \"error running rpm-ostree kargs: signal: terminated\\n\"",
                "reason": "1 nodes are reporting degraded status on sync",
                "status": "True",
                "type": "NodeDegraded"
            },
3. After some mintures (once the node is completely rebooted) the pool stops reporting a degraded status

Actual results:

The MachineConfigPool is degraded

Expected results:

MachineConfigPools should never report a degraded status with a valid MC

Additional info:

It looks like we are executing the  "rpm-ostree kargs" command right after we execute the "systemctl reboot" command.

17:20:51.570629    4658 update.go:1897] Removing SIGTERM protection   
17:20:51.570646    4658 update.go:1867] initiating reboot: Node will reboot into config rendered-worker-923735505fa2d7a5811b9c5866c5ad12
17:20:51.579923    4658 update.go:1867] reboot successful
17:20:51.582415    4658 daemon.go:518] Transitioned from state: Done -> Working
17:20:51.582426    4658 daemon.go:523] State and Reason: Working
17:20:51.609420    4658 rpm-ostree.go:400] Running captured: rpm-ostree kargs
17:20:51.612228    4658 daemon.go:600] Preflight config drift check failed: error running rpm-ostree kargs: signal: terminated 
17:20:51.612244    4658 writer.go:200] Marking Degraded due to: error running rpm-ostree kargs: signal: terminated 
17:20:51.614830    4658 daemon.go:1030] Shutting down MachineConfigDaemon


We have not seen this problem in other platforms different from baremetal.

You can find the links to the logs before and after the reboot  in the comments.

Description of the problem:

[Reproduced in Staging]

In prod [BE v2.12.2][UI v2.12.1], creating a cluster with dot '.' in the name, BE and UI allows this, but when starting installation - getting the following error:

 Failed to prepare the installation due to an unexpected error: failed generating install config for cluster 878657b7-cff1-47c9-9f82-464cf7b6e2e9: error running openshift-install manifests, level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: metadata.name: Invalid value: "test2.12": cluster name must not contain '.' : exit status 3. Please retry later

 

How reproducible:

100%

Steps to reproduce:

1. Create cluster with dot sign ( '.') in the name - for example test2.12

2. start installation

3.

Actual results:

 

Expected results:

 

 

 

 

 

Description of problem:

cluster-policy-controller has  unnecessary permissions and is able to operate on all leases in KCM namespace. This also applies to namespace-security-allocation-controller that was moved some time ago and does not need lock mechanism.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 
 
 

 

cloud-controller-manager does not react to changes to infrastructure secrets (in the OpenStack case: clouds.yaml).
As a consequence, if credentials are rotated (and the old ones are rendered useless), load balancer creation and deletion will not succeed any more. Restarting the controller fixes the issue on a live cluster.

Logs show that it couldn't find the application credentials:

Dec 19 12:58:58.909: INFO: At 2022-12-19 12:53:58 +0000 UTC - event for udp-lb-default-svc: {service-controller } EnsuringLoadBalancer: Ensuring load balancer
Dec 19 12:58:58.909: INFO: At 2022-12-19 12:53:58 +0000 UTC - event for udp-lb-default-svc: {service-controller } SyncLoadBalancerFailed: Error syncing load balancer: failed to ensure load balancer: failed to get subnet to create load balancer for service e2e-test-openstack-q9jnk/udp-lb-default-svc: Unable to re-authenticate: Expected HTTP response code [200 204 300] when accessing [GET https://compute.rdo.mtl2.vexxhost.net/v2.1/0693e2bb538c42b79a49fe6d2e61b0fc/servers/fbeb21b8-05f0-4734-914e-926b6a6225f1/os-interface], but got 401 instead
{"error": {"code": 401, "title": "Unauthorized", "message": "The request you have made requires authentication."}}: Resource not found: [POST https://identity.rdo.mtl2.vexxhost.net/v3/auth/tokens], error message: {"error":{"code":404,"message":"Could not find Application Credential: 1b78233956b34c6cbe5e1c95445972a4.","title":"Not Found"}}

OpenStack CI has been instrumented to restart CCM after credentials rotation, so that we silence this particular issue and avoid masking any other. That workaround must be reverted once this bug is fixed.

This is a clone of issue OCPBUGS-10031. The following is the description of the original issue:

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-metal-ipi-sdn-virtualmedia

Reproduced locally, the failure is:

level=error msg=Attempted to gather debug logs after installation failure: must provide bootstrap host address                                                                               
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected                
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected                
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected                                   
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected                                   
level=error msg=Cluster operator network Degraded is True with ApplyOperatorConfig: Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBindi
ng) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-publi
c-role-binding: Patch "https://api-int.ostest.test.metalkube.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding
?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 192.168.111.5:6443: connect: connection refused 

I saw this occur one time when running installs in a continuous loop. This was with COMPaCT_IPV4 in a non-disconnected setup.

WaitForBootrapComplete shows it can't access the API

level=info msg=Unable to retrieve cluster metadata from Agent Rest API: no clusterID known for the cluster
level=debug msg=cluster is not registered in rest API
level=debug msg=infraenv is not registered in rest API

This is because create-cluster-and-infraenv.service failed

Failed Units: 2
  create-cluster-and-infraenv.service
  NetworkManager-wait-online.service

The agentbasedinstaller register command wasn't able to retrieve the image to get the version

Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"
Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="failed to get image openshift version from release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451" error="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"

This occurs when attempting to get the release here:
https://github.com/openshift/assisted-service/blob/master/cmd/agentbasedinstaller/register.go#L58

 

We should add a retry mechanism or restart the service to handle spurious network failures like this.

 

 

In some cases it will be required to edit AgentClusterInstall spec fields after the install finishes.

In a particular case it was required to add an ignition override to add workers in day2 to workaround a bug, but this was not possible due to the current implementation of the admission webhooks here.

Related slack thread

This is a clone of issue OCPBUGS-11371. The following is the description of the original issue:

Description of problem:

oc-mirror fails to complete with heads only complaining about devworkspace-operator

Version-Release number of selected component (if applicable):

# oc-mirror version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.12.0-202302280915.p0.g3d51740.assembly.stream-3d51740", GitCommit:"3d517407dcbc46ededd7323c7e8f6d6a45efc649", GitTreeState:"clean", BuildDate:"2023-03-01T00:20:53Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Attempt a headsonly mirroring for registry.redhat.io/redhat/redhat-operator-index:v4.10

Steps to Reproduce:

1. Imageset currently:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  registry:
    imageURL: myregistry.mydomain:5000/redhat-operators
    skipTLS: false
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
2.$ oc mirror --config=./imageset-config.yml docker://otherregistry.mydomain:5000/redhat-operators

Checking push permissions for otherregistry.mydomain:5000
Found: oc-mirror-workspace/src/publish
Found: oc-mirror-workspace/src/v2
Found: oc-mirror-workspace/src/charts
Found: oc-mirror-workspace/src/release-signatures
WARN[0026] DEPRECATION NOTICE:
Sqlite-based catalogs and their related subcommands are deprecated. Support for
them will be removed in a future release. Please migrate your catalog workflows
to the new file-based catalog format. 

The rendered catalog is invalid.

Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information.  

error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"  

Actual results:

error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"

Expected results:

For the catalog to be mirrored.

Description of problem:

When installing an OpenShift cluster with OVN-Kubernetes in a specific OpenStack cloud used for some CI jobs we've notice the ovnkube-node Pod
crashloops because "F1027 10:10:08.351527   31511 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found". Installation is unable to progress. Two services have failed on the master nodes afterburn-hostname.service and ovs-configuration.service. 

Additional information that might be helpful, the network of the nodes have a MTU of 1242 configured.

DEVICE          TYPE           STATE      CONNECTION         
ens3            ethernet       connected  Wired connection 1 
genev_sys_6081  geneve         unmanaged  --                 
lo              loopback       unmanaged  --                 
br-int          ovs-bridge     unmanaged  --                 
br-int          ovs-interface  unmanaged  --                 
ovn-k8s-mp0     ovs-interface  unmanaged  --                 
br-int          ovs-port       unmanaged  --                 
ovn-e8fef9-0    ovs-port       unmanaged  --                 
ovn-k8s-mp0     ovs-port       unmanaged  -- 

[core@sqknth4w-71fac-gnwxk-master-1 ~]$ sudo systemctl status ovs-configuration.service
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2022-10-27 09:52:51 UTC; 1h 22min ago
  Process: 1578 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=3)
 Main PID: 1578 (code=exited, status=3)
      CPU: 1.528s

Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 configure-ovs.sh[5434]: default via 10.0.0.1 dev ens3 proto dhcp src 10.0.0.27 metric 100
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 configure-ovs.sh[5434]: 10.0.0.0/16 dev ens3 proto kernel scope link src 10.0.0.27 metric 100
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 configure-ovs.sh[5434]: 169.254.169.254 via 10.0.0.10 dev ens3 proto dhcp src 10.0.0.27 metric 100
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 configure-ovs.sh[1578]: + ip -6 route show
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 configure-ovs.sh[5435]: ::1 dev lo proto kernel metric 256 pref medium
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 configure-ovs.sh[1578]: + exit 3
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Oct 27 09:52:51 sqknth4w-71fac-gnwxk-master-1 systemd[1]: ovs-configuration.service: Consumed 1.528s CPU time

Version-Release number of selected component (if applicable):
Our CI jobs are not getting detailed logs, but it seems they have started failing once we switched to use ovn-k by default OCT 10.
https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.12-e2e-openstack-csi-cinder

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of problem:

Specifying spec.nodePlacement.nodeSelector.matchExpressions on an IngressController API object causes cluster-ingress-operator to log error messages instead of configuring a node selector.

Version-Release number of selected component (if applicable):

All versions of OpenShift from 4.1 to 4.12.

How reproducible:

100%.

Steps to Reproduce:

1. Create an IngressController object with the following:

spec: 
  nodePlacement: 
    nodeSelector: 
      matchExpressions: 
      - key: node.openshift.io/remotenode
        operator: DoesNotExist

(Sorry if Jira has misformatted the yaml. I've given up on getting Jira to format it correctly. Edit the description to see the correctly formatted yaml.)
2. Check the cluster-ingress-operator logs: oc -n openshift-ingress-operator logs -c ingress-operator deploy/ingress-operator

Actual results:

The cluster-ingress-operator logs show the following error message:

2022-01-19T13:25:22.242Z	ERROR	operator.init.controller-runtime.manager.controller.ingress_controller	controller/controller.go:253	Reconciler error	{"name": "default", "namespace": "openshift-ingress-operator", "error": "failed to ensure deployment: failed to build router deployment: ingresscontroller \"default\" has invalid spec.nodePlacement.nodeSelector: operator \"NotIn\" cannot be converted into the old label selector format", "errorCauses": [{"error": "failed to ensure deployment: failed to build router deployment: ingresscontroller \"default\" has invalid spec.nodePlacement.nodeSelector: operator \"DoesNotExist\" cannot be converted into the old label selector format"}]}

Expected results:

Ideally, router pods should be configured with the specified node selector, and cluster-ingress-operator should not log an error. Unfortunately, this result cannot be implemented (see "Additional info").

Alternatively, we should document that using the spec.nodePlacement.nodeSelector.matchExpressions is unsupported.

Additional info:

Although it is possible to put a complex match expression in the IngressController.spec.nodePlacement.nodeSelector API field, it is impossible for the operator to convert this into a node selector for the router deployment's pod template spec because the latter requires the node selector be in a string form, and the string form for node selectors does not support complex expressions. This is an unfortunate oversight in the design of the API. We cannot make complex expressions work, and we cannot make a breaking API change, so the only feasible option here is to change the API godoc to warn users that using matchExpressions is not supported.

Related discussion: https://github.com/openshift/api/pull/870#discussion_r601577395.

This Jira issue is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2043573 to placate automation.

Description of problem:

The 4.12.0 openshift-client package has kubectl 1.24.1 bundled in it when it should have 1.25.x 

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Very

Steps to Reproduce:

1. Download and unpack https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/stable/openshift-client-linux-4.12.0.tar.gz 
2. ./kubectl version

Actual results:

# ./kubectl version

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"1928ac4250660378a7d8c3430478dfe77977cb2a", GitTreeState:"clean", BuildDate:"2022-12-07T05:08:22Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4

Expected results:

kubectl version 1.25.x 

Additional info:

 

This is a clone of issue OCPBUGS-7395. The following is the description of the original issue:

Description of problem

Since resource type option has been moved to an advanced option in both the Deploy Image and Import from Git flows, there is confusion for some existing customers who are using the feature.

The UI no longer provides transparency of the type of resource which is being created.

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

1.
2.
3.

Actual results

Expected results

Remove Resource type from Adv Options, and place it back where it was previously.  Resource type selection is now a dropdown so that we will put it in its previous spot, but it will use a different component from 4.11.

  •  

Description of problem:
Existing shipwright tests was disabled in
https://github.com/openshift/console/pull/11947 to solve
https://bugzilla.redhat.com/show_bug.cgi?id=2116415

We should re-enable them again.

Version-Release number of selected component (if applicable):
4.11 and 4.12, only on the CI

How reproducible:
Always

Steps to Reproduce:
1. Run our e2e tests

Actual results:
shipwright cypress tests are not executed

Expected results:
shipwright cypress tests should be executed

Additional info:

Description of problem:

Resourse added toast always have text "Deployment created successfully." irrespective of any resource type selected in the form

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Navigate to developer perspective Add page and select/create a namespace
2. Select Import from Git option 
3. Enter all the required values and select resource type DeploymentConfig and click Create
4. See the toast notification texts

Actual results:

Deployment created successfully. texts for DeploymentConfig resource

Expected results:

Text in the Toast should be generic or based on type of resource created by the user.

Additional info:

 

Description of problem:

Creating cluster on Azure, installation failed as network is degraded, networkType is OpenShiftSDN.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-12-024338

How reproducible:

Always

Steps to Reproduce:

1. Set up a cluster on Azure, networktype is SDN
2.
3.

Actual results:

Installation failed as network is degraded.
$ oc get co
network                                    4.13.0-0.nightly-2023-02-12-024338   True        True          True       81m     DaemonSet "/openshift-sdn/sdn" rollout is not making progress - pod sdn-897ql is in CrashLoopBackOff State...

$ oc get po -n openshift-sdn                  
NAME                   READY   STATUS             RESTARTS         AGE
sdn-897ql              2/3     CrashLoopBackOff   20 (4m7s ago)    82m
sdn-bm5vr              2/3     CrashLoopBackOff   20 (4m9s ago)    82m
sdn-bwnpk              2/3     CrashLoopBackOff   24 (57s ago)     73m
sdn-ch2mc              2/3     CrashLoopBackOff   20 (4m19s ago)   82m
sdn-controller-9hgv2   2/2     Running            1 (71m ago)      82m
sdn-controller-gvmpg   2/2     Running            0                82m
sdn-controller-hnftp   2/2     Running            0                82m
sdn-kc66z              2/3     CrashLoopBackOff   24 (59s ago)     73m
sdn-t7vxf              2/3     CrashLoopBackOff   17 (3m30s ago)   66m

$ oc rsh sdn-897ql                        
Defaulted container "sdn" out of: sdn, kube-rbac-proxy, drop-icmp
sh-4.4# oc observe pods -n openshift-sdn --listen-addr= -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
sh: oc: command not found

$ oc logs sdn-897ql -n openshift-sdn -c drop-icmp      
+ touch /var/run/add_iptables.sh
+ chmod 0755 /var/run/add_iptables.sh
+ cat
++ date '+%m%d %H:%M:%S.%N'
+ echo 'I0213 01:43:01.287492831 - drop-icmp - start drop-icmp zhsunaz213-47hnp-master-2'
I0213 01:43:01.287492831 - drop-icmp - start drop-icmp zhsunaz213-47hnp-master-2
+ iptables -X AZURE_CHECK_ICMP_SOURCE
iptables v1.8.4 (nf_tables):  CHAIN_USER_DEL failed (Device or resource busy): chain AZURE_CHECK_ICMP_SOURCE
+ true
+ iptables -N AZURE_CHECK_ICMP_SOURCE
iptables: Chain already exists.
+ true
+ iptables -F AZURE_CHECK_ICMP_SOURCE
+ iptables -D INPUT -p icmp --icmp-type fragmentation-needed -j AZURE_CHECK_ICMP_SOURCE
+ iptables -I INPUT -p icmp --icmp-type fragmentation-needed -j AZURE_CHECK_ICMP_SOURCE
+ iptables -N AZURE_ICMP_ACTION
iptables: Chain already exists.
+ true
+ iptables -F AZURE_ICMP_ACTION
+ iptables -A AZURE_ICMP_ACTION -j LOG
+ iptables -A AZURE_ICMP_ACTION -j DROP
+ oc observe pods -n openshift-sdn --listen-addr= -l app=sdn -a '{ .status.hostIP }' -- /var/run/add_iptables.sh
/bin/bash: line 30: oc: command not found

Expected results:

Installation is successful.

Additional info:

must-gather: https://drive.google.com/file/d/1ObyVLXgbdciZehfIO_58EucX0qQjugI7/view?usp=sharing

Description of the problem:

We tried to install 3+1 OCP 4.11 and 4.12 - issue is that installation enved with a failure, CVO timed out

How reproducible:

100%

Steps to reproduce:

1. Install 3 + 1 Cluster 

Actual results:

CVO fails

Expected results:

Installation ended with success

Description of problem:

For OVNK to become CNCF complaint, we need to support session affinity timeout feature and enable the e2e's on OpenShift side. This bug tracks the efforts to get this into 4.12 OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
When searched for a resource, I tried to add some to the pinned resource with the "Add to navigation" button. But nothing happened. The browser log shows this error:

main-881e214a8ddf8f8a8eb8.js:53201 unhandled error: Uncaught TypeError: Cannot read properties of undefined (reading 'length') TypeError: Cannot read properties of undefined (reading 'length')
    at http://localhost:9000/static/main-881e214a8ddf8f8a8eb8.js:37694:147
    at http://localhost:9000/static/main-881e214a8ddf8f8a8eb8.js:38349:57
    at http://localhost:9000/static/main-881e214a8ddf8f8a8eb8.js:37693:9
    at pinToggle (http://localhost:9000/static/main-881e214a8ddf8f8a8eb8.js:67199:9)
    at onClick (http://localhost:9000/static/main-881e214a8ddf8f8a8eb8.js:67281:341)
    at HTMLUnknownElement.callCallback (http://localhost:9000/static/vendors~main-99688ccb22de160eb977.js:446274:14)
    at Object.invokeGuardedCallbackDev (http://localhost:9000/static/vendors~main-99688ccb22de160eb977.js:446323:16)

main-881e214a8ddf8f8a8eb8.js:37694 Uncaught TypeError: Cannot read properties of undefined (reading 'length')
    at main-881e214a8ddf8f8a8eb8.js:37694:147
    at main-881e214a8ddf8f8a8eb8.js:38349:57
    at main-881e214a8ddf8f8a8eb8.js:37693:9
    at pinToggle (main-881e214a8ddf8f8a8eb8.js:67199:9)
    at onClick (main-881e214a8ddf8f8a8eb8.js:67281:341)
    at HTMLUnknownElement.callCallback (vendors~main-99688ccb22de160eb977.js:446274:14)
    at Object.invokeGuardedCallbackDev (vendors~main-99688ccb22de160eb977.js:446323:16)
    at invokeGuardedCallback (vendors~main-99688ccb22de160eb977.js:446385:31)
    at invokeGuardedCallbackAndCatchFirstError (vendors~main-99688ccb22de160eb977.js:446399:25)
    at executeDispatch (vendors~main-99688ccb22de160eb977.js:450572:3)
...
window.onerror @ main-881e214a8ddf8f8a8eb8.js:53201
vendors~main-99688ccb22de160eb977.js:446420 Uncaught TypeError: Cannot read properties of undefined (reading 'length')
    at main-881e214a8ddf8f8a8eb8.js:37694:147
    at main-881e214a8ddf8f8a8eb8.js:38349:57
    at main-881e214a8ddf8f8a8eb8.js:37693:9
    at pinToggle (main-881e214a8ddf8f8a8eb8.js:67199:9)
    at onClick (main-881e214a8ddf8f8a8eb8.js:67281:341)
    at HTMLUnknownElement.callCallback (vendors~main-99688ccb22de160eb977.js:446274:14)
    at Object.invokeGuardedCallbackDev (vendors~main-99688ccb22de160eb977.js:446323:16)
    at invokeGuardedCallback (vendors~main-99688ccb22de160eb977.js:446385:31)
    at invokeGuardedCallbackAndCatchFirstError (vendors~main-99688ccb22de160eb977.js:446399:25)
    at executeDispatch (vendors~main-99688ccb22de160eb977.js:450572:3)

After some research I noticed this happen when the pinnedResources was {} in the user settings ConfigMap. I don't know how I can reproduce with just UI interactions.

Version-Release number of selected component (if applicable):
4.13

How reproducible:
Always when manually modifying the ConfigMap, unsure how to reproduce this just with UI interactions.

Steps to Reproduce:
Open the user-settings ConfigMap and set "console.pinedResources" to "{}" (with quotes as all ConfigMap values need to be strings)

  console.pinnedResources: '{}'

Or run this patch command:

oc patch -n openshift-console-user-settings configmaps user-settings-kubeadmin --type=merge --patch '{"data":{"console.pinnedResources":"null"}}'

After that:

  1. Open dev perspective
  2. Navigate to Search
  3. Select a resource type
  4. Click on "Add to navigation"

Actual results:
Nothing happen when clicking on "Add to navigation"

(Browser log shows error above).

Expected results:
Resource type should be added to the navigation.

Additional info:

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Description of problem:

The cluster infrastructure object changes introduced for tech preview zonal installs is failing validation.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-20-104328

How reproducible:

consistently

Steps to Reproduce:

1. Create manifest with openshift-install
2. Check cluster infrastructure manifest 
3. Installation proceeds and the cluster infrastructure object is missing failure domains

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-5943. The following is the description of the original issue:

Kube 1.26 introduced the warning level TopologyAwareHintsDisabled event. TopologyAwareHintsDisabled is fired by the EndpointSliceController whenever reconciling a service that has activated topology aware hints via the service.kubernetes.io/topology-aware-hints annotation, but there is not enough information in the existing cluster resources (typically nodes) to apply the topology aware hints.

When re-basing OpnShift onto Kube 1.26, are CI builds are failing (except on AWS), because these events are firing "pathologically", for example:

: [sig-arch] events should not repeat pathologically
  events happened too frequently event happened 83 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 result=reject 

AWS nodes seem to have the proper values in the nodes. GCP has the values also, but they are not "right" for the purposes of the EndpointSliceController:

event happened 38 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 result=reject }

https://github.com/openshift/origin/pull/27666 will mask this problem (make it stop erroring in CI) but changes still need to be made in the product so end users are not subjected to these events.

 

Description of the problem:
When i try to enable lvmo operator via api i am not receiving a API Bad request error
 and i am able to set it up
Instead there is validation error in the Cluster

	"{\"configuration\":[{\"id\":\"pull-secret-set\",\"status\":\"success\",\"message\":\"The pull secret is set.\"}],\"hosts-data\":[{\"id\":\"all-hosts-are-ready-to-install\",\"status\":\"failure\",\"message\":\"The cluster has hosts that are not ready to install.\"},{\"id\":\"sufficient-masters-count\",\"status\":\"success\",\"message\":\"The cluster has a sufficient number of master candidates.\"}],\"network\":[{\"id\":\"api-vip-defined\",\"status\":\"success\",\"message\":\"The API virtual IP is defined.\"…":\"success\",\"message\":\"The Service Network CIDR is defined.\"}],\"operators\":[{\"id\":\"cnv-requirements-satisfied\",\"status\":\"success\",\"message\":\"cnv is disabled\"},{\"id\":\"lso-requirements-satisfied\",\"status\":\"success\",\"message\":\"lso is disabled\"},{\"id\":\"lvm-requirements-satisfied\",\"status\":\"failure\",\"message\":\"ODF LVM operator is only supported for Single Node Openshift\"},{\"id\":\"odf-requirements-satisfied\",\"status\":\"success\",\"message\":\"odf is disabled\"}]}"

How reproducible:

Steps to reproduce:

1. create non SNO Cluster

2. try to use api to enable lvmo operator

3.

Actual results:

 Validation error

Expected results:

I expected to get bad request , and to not be able to enable the lvmo at all

This is a clone of issue OCPBUGS-8449. The following is the description of the original issue:

Description of problem:

Configure diskEncryptionSet as below in install-config.yaml, and not set subscriptionID as it is optional parameter.

install-config.yaml
--------------------------------
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      encryptionAtHost: true
      osDisk:
        diskEncryptionSet:
          resourceGroup: jima07a-rg
          name: jima07a-des
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      encryptionAtHost: true
      osDisk:
        diskEncryptionSet:
          resourceGroup: jima07a-rg
          name: jima07a-des
  replicas: 3
platform:
  azure:
    baseDomainResourceGroupName: os4-common
    cloudName: AzurePublicCloud
    outboundType: Loadbalancer
    region: centralus
    defaultMachinePlatform:
      osDisk:
        diskEncryptionSet:
          resourceGroup: jima07a-rg
          name: jima07a-des

Then create manifests file and create cluster, installer failed with error:
$ ./openshift-install create cluster --dir ipi --log-level debug
...
INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" 
FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet: Invalid value: azure.DiskEncryptionSet{SubscriptionID:"", ResourceGroup:"jima07a-rg", Name:"jima07a-des"}: failed to get disk encryption set: compute.DiskEncryptionSetsClient#Get: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="InvalidSubscriptionId" Message="The provided subscription identifier 'resourceGroups' is malformed or invalid." 

Checked manifest file cluster-config.yaml, and found that subscriptionId is not filled out automatically under defaultMachinePlatform
$ cat cluster-config.yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: qe.azure.devcluster.openshift.com
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform:
        azure:
          encryptionAtHost: true
          osDisk:
            diskEncryptionSet:
              name: jima07a-des
              resourceGroup: jima07a-rg
              subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a
            diskSizeGB: 0
            diskType: ""
          osImage:
            offer: ""
            publisher: ""
            sku: ""
            version: ""
          type: ""
      replicas: 3
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform:
        azure:
          encryptionAtHost: true
          osDisk:
            diskEncryptionSet:
              name: jima07a-des
              resourceGroup: jima07a-rg
              subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a
            diskSizeGB: 0
            diskType: ""
          osImage:
            offer: ""
            publisher: ""
            sku: ""
            version: ""
          type: ""
      replicas: 3
    metadata:
      creationTimestamp: null
      name: jimadesa
    networking:
      clusterNetwork:
      - cidr: 10.128.0.0/14
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.0.0.0/16
      networkType: OVNKubernetes
      serviceNetwork:
      - 172.30.0.0/16
    platform:
      azure:
        baseDomainResourceGroupName: os4-common
        cloudName: AzurePublicCloud
        defaultMachinePlatform:
          osDisk:
            diskEncryptionSet:
              name: jima07a-des
              resourceGroup: jima07a-rg
            diskSizeGB: 0
            diskType: ""
          osImage:
            offer: ""
            publisher: ""
            sku: ""
            version: ""
          type: ""
        outboundType: Loadbalancer
        region: centralus
    publish: External

It works well when setting disk encryption set without subscriptionId under defalutMachinePlatform or controlPlane/compute.    

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-05-104719

How reproducible:

Always on 4.11, 4.12, 4.13

Steps to Reproduce:

1. Prepare install-config, configure diskEncrpytionSet under defaultMchinePlatform, controlPlane and compute without subscriptionId
2. Install cluster 
3.

Actual results:

Cluster is installed successfully

Expected results:

installer failed

Additional info:

 

 

 

 

Assisted service application has a memory consumption pattern that keeps growing towards a given number (at the moment seems to be around 1.5GB).

This might be due to many factors, but one possibility is that this upper limit is determined by underlying data (the more data, the higher the "upper limit").

We suspect this because:

  • stage pattern tend to stabilize to a much lower number (around 500MB)
  • we have noticed an inconsistent behaviour when requesting cluster pages: sometimes many requests to the same endpoint will return different results. Most likely due to stale caching in one of the pod, and depending which pod the request will hit we will get a different response

 

Below we can observe the memory consumption pattern described (can also be found at https://grafana.app-sre.devshift.net/d/assisted-service-resource-dashboard/assisted-service-resource-dashboard?orgId=1&from=now-30d&to=now&viewPanel=34):

 

 

Description of problem:

on ARM cluster, error plugins will block console from loading

- the issue can be easily reproduced on ARM cluster
- the issue is not reproduced on amd nightly 4.13.0-0.nightly-2022-11-17-094851 or 4.12.0-0.nightly-2022-11-17-164258
- this issue may not only reproducible on ARM, thinking an error plugin on amd should also reproduce 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-arm64-2022-11-17-195451

How reproducible:

Always

Steps to Reproduce:

1. Create 3 consoleplugins
$ oc create -f failed-console-demo-plugin.yaml
namespace/console-demo-plugin created
deployment.apps/console-demo-plugin created
service/console-demo-plugin created
consoleplugin.console.openshift.io/console-demo-plugin created
$ oc create -f failed-console-customization-plugin.yaml
namespace/console-customization-plugin created
deployment.apps/console-customization-plugin created
configmap/nginx-conf created
service/console-customization-plugin created
consoleplugin.console.openshift.io/console-customization created
$ oc create -f pending-console-demo-plugin-1.yaml
namespace/console-demo-plugin-1 created
deployment.apps/console-demo-plugin-1 created
service/console-demo-plugin-1 created
consoleplugin.console.openshift.io/console-demo-plugin-1 created

2. on ARM cluster, plugin deployment can not be ready since all plugin images are built on AMD64
$ oc get pods -n console-demo-plugin
NAME                                  READY   STATUS   RESTARTS      AGE
console-demo-plugin-86f8d5497-6znr2   0/1     Error    3 (31s ago)   52s
$ oc get pods -n console-demo-plugin-1
NAME                                    READY   STATUS   RESTARTS      AGE
console-demo-plugin-1-c856989b4-q2pb9   0/1     Error    2 (28s ago)   34s
$ oc get pods -n console-customization-plugin
NAME                                            READY   STATUS   RESTARTS      AGE
console-customization-plugin-568cf77f67-g8cd7   0/1     Error    3 (32s ago)   55s 
3. Enable the plugins
$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-demo-plugin","console-customization","console-demo-plugin-1"] } }' --type=merge console.operator.openshift.io/cluster patched
4. Refresh console and load plugins 

Actual results:

3. console can not be loaded successfully

Expected results:

3. any kind of plugin error should not block console from loading

Additional info:

 

 

 

 

 

 

Follow up for https://issues.redhat.com/browse/HOSTEDCP-969

Create metrics and grafana panel in

https://hypershift-monitoring.homelab.sjennings.me:3000/d/PGCTmCL4z/hypershift-slos-slis-alberto-playground?orgId=1&from=now-24h&to=now

https://github.com/openshift/hypershift/tree/main/contrib/metrics

for NodePool internal SLOs/SLIs:

  • NodePoolDeletionDuration
  • NodePoolInitialRolloutDuration

Move existing metrics when possible from metrics loop into nodepool controller:

- nodePoolSize

Explore and discuss granular metrics to track NodePool lifecycle bottle necks, infra, ignition, node networking, available. Consolidate that with hostedClusterTransitionSeconds metrics and dashboard panels

Explore and discuss metrics for upgrade duration SLO for both HC and NodePool.

Background

Currently, disk-speed-check runs FIO with the wrong command line arguments. The way it runs, the speed is checked against the RAM file system, and not the installation disk

In https://issues.redhat.com/browse/MGMT-11885 Vadim Rutkovsky fixed that problem, but this created new problems, since FIO writes to disk and deletes the partition table.
The problems arise when the installation disk contains LVM, since FIO writes to disk before the disk cleanup function runs, and then the cleanup function does not find any LVM partitions, and skips cleanup

To address this second problem, we added an offset to fio so the partition table is not wiped:
https://issues.redhat.com/browse/MGMT-12139

Even so, Elia Jahshan found that the offset fix was not working in some corner case situations

Current situation

To prevent further problems, we decided to revert both changes mentioned before, and find a better solution for all this

Proposed solution

The proposed solution implies running the cleanup function before the disk-speed-check when LVM is detected:
1. detect the boot disk partition type
2. if boot disk type is LVM we need to wipe it before speed check, so that fio would not mess with it

Description of problem:

The dropdown in the Logs tab for Windows nodes does not show the `containerd` and `wicd` services

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

In a OCP cluster with a Windows node:
1. Go to the Node page for the Windows node
2. Go to the Logs tab 
3. Click the dropdown to select the service

Actual results:

- jounal
- containers
- hybrid-overlay
- kube-proxy
- kubelet

Expected results:

 -jounal 
- containers
- hybrid-overlay
- kube-proxy
- kubelet
- containerd
- wicd

Additional info:

containerd and wicd are both file based logs, and the corresponding path should be added in the pathItems[1] for Windows nodes.

[1] https://github.com/openshift/console/blob/14337d93380f34114e09234e6d3efbf69f509149/frontend/packages/console-app/src/components/nodes/NodeLogs.tsx#L155

 

Description of problem:

The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
NAMESPACE          NAME                      AGE
openshift-multus   2001-1b70-820d-4b04--13   4m53s
openshift-multus   2001-1b70-820d-4b05--13   4m49s

2.  Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             4d13h

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
No resources found

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        5s              4d13h

 

Actual results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob.
The "overlapping ranges" are not used. 

Expected results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.

Additional info:

 

Description of the problem:

Currently when the cluster is already installed the compatible agent validation is enabled. That means that if the service is restarted with a different version of the agent (a different `AGENT_DOCKER_IMAGE` environment variable) the validation will trigger and result in a confusing event.

Note that this doesn't mean that the agent will be really upgraded as sending the upgrade step to the agent is disabled in that cluster state.

How reproducible:

Always.

Steps to reproduce:

1. Prepare a cluster and start the installation.

2. Restart the service with a different `AGENT_DOCER_IMAGE` environment variable.

Actual results:

An warning event is generated explaining that the validation is failing.

Expected results:

No event should be generated.

We faced an issue where the quota was reached for VPCE. This is visible in the status of AWSEndpointService

  - lastTransitionTime: "2023-03-01T10:23:08Z"
    message: 'failed to create vpc endpoint: VpcEndpointLimitExceeded'
    reason: AWSError
    status: "False"
    type: EndpointAvailable

but it should be propagated to the HC as it blocks worker creation (ignition was not working) and for better visibility.

 

We should include HostedClusterDegraded in hypershift_hostedclusters_failure_conditions metric so it's obvious when there's an issue across the fleet.

  • lastTransitionTime: "2023-05-04T13:53:50Z" message: kube-controller-manager deployment has 1 unavailable replicas observedGeneration: 1 reason: UnavailableReplicas status: "True" type: Degraded