Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Today we expose two main APIs for HyperShift, namely `HostedCluster` and `NodePool`. We also have metrics to gauge adoption by reporting the # of hosted clusters and nodepools.
But we are still missing other metrics to be able to make correct inference about what we see in the data.
Today we have hypershift_hostedcluster_nodepools as a metric exposed to provide information on the # of nodepools used per cluster.
Additional NodePools metrics such as hypershift_nodepools_size and hypershift_nodepools_available_replicas are available but not ingested in Telemetry.
In addition to knowing how many nodepools per hosted cluster, we would like to expose the knowledge of the nodepool size.
This will help inform our decision making and provide some insights on how the product is being adopted/used.
The main goal of this epic is to show the following NodePools metrics on Telemeter, ideally as recording rules:
The implementation involves creating updates to the following GitHub repositories:
similar PRs:
https://github.com/openshift/hypershift/pull/1544
https://github.com/openshift/cluster-monitoring-operator/pull/1710
This feature is about providing workloads within an HCP KubeVirt cluster access to gpu devices. This is an important use case that expands usage of HCP KubeVirt to AL and ML workloads.
GOAL:
Support running workloads within HCP KubeVirt clusters which need access to GPUs.
Accomplishing this involves multiple efforts
Diagram of multiple nvidia operator layers
https://docs.google.com/document/d/1HwXVL_r9tUUwqDct8pl7Zz4bhSRBidwvWX54xqXaBwk/edit
This will be covered by HCP doc team.
We start by contributing the documentation upstream to the hypershift repo which is published here, https://hypershift-docs.netlify.app/. Then create a task for the ACM docs team to port those changes to the official documentation. They use our content as a seed for the official documentation content. (contact points is Laura Hinson - on parental leave, and Servesha Dudhgaonkar)
Graduce the new PV access mode ReadWriteOncePod as GA.
Such PV/PVC can be used only in a single pod on a single node compared to the traditional ReadWriteOnce access mode, where such a PV/PVC can be used on a single node by many pods.
The customers can start using the new ReadWriteOncePod access mode.
This new mode allows customers to provision and attach PV and get the guarantee that it cannot be attached to another local pod.
This new mode should support the same operations as regular ReadWriteOnce PVs therefore it should pass the regression tests. We should also ensure that this PV can't be accessed by another local-to-node pod.
As a user I want to attach a PV to a pod and ensure that it can't be accessed by another local pod.
We are getting this feature from upstream as GA. We need to test it and fully support it.
Check that there is no limitations / regression.
Remove tech preview warning. No additional change.
N/A
Support upstream feature "New RWO access mode " in OCP as GA, i.e. test it and have docs for it.
This is continuation of STOR-1171 (Beta/Tech Preview in 4.14), now we just need to mark it as GA and remove all TechPreview notes from docs.
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
Customers can override the default (three) value and set it to a custom value.
Make sure we document (or link) the VMWare recommendations in terms of performances.
https://kb.vmware.com/s/article/1025279
The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.
No change in the default
As an OCP admin I would like to change the maximum number of snapshots per volumes.
Anything outside of
The default value can't be overwritten, reconciliation prevents it.
Make sure the customers understand the impact of increasing the number of snapshots per volume.
https://kb.vmware.com/s/article/1025279
Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.
N/A
Epic Goal*
The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.
Possible future candidates:
Why is this important? (mandatory)
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
https://kb.vmware.com/s/article/1025279
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
1) Write OpenShift enhancement (STOR-1759)
2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)
3) Update vSphere operator to use the new snapshot options (STOR-1804)
4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.
Drawbacks or Risk (optional)
Setting this config setting with a high value can introduce performances issues. This needs to be documented.
https://kb.vmware.com/s/article/1025279
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Support the SMB CSI driver through an OLM operator as tech preview. The SMB CSI driver allows OCP to consume SMB/CIFS storage with a dynamic CSI driver. This enables customers to leverage their existing storage infrastructure with either SAMBA or Microsoft environment.
https://github.com/kubernetes-csi/csi-driver-smb
Customers can start testing connecting OCP to their backend exposing CIFS. This can allow to consume net new volume or consume existing data produced outside OCP.
Driver already exists and is under the storage SIG umbrella. We need to make sure the driver is meeting OCP quality requirement and if so develop an operator to deploy and maintain it.
Review and clearly define all driver limitations and corner cases.
Review the different authentication method.
Windows containers support.
Only storage class login/password authentication method. Other methods can be reviewed and considered for GA.
Customers are expecting to consume storage and possibly existing data via SMB/CIFS. As of today vendor's drivers support is really limited in terms of CIFS support whereas this protocol is widely used on premise especially with MS/AD customers.
Need to understand what customers expect in terms of authentication.
How to extend this feature to windows containers.
Document the operator and driver installation, usage capabilities and limitations.
Future: How to manage interoperability with windows containers (not for TP)
Graduate the SMB CSI driver and its operator to GA
The Goal is to write an operator to deploy and maintain the SMB CSI driver
https://github.com/kubernetes-csi/csi-driver-smb
Authentication will be limited to a secret in the storage class. NTLM style authentication only, no kerberos support until we have it officialy supported and documented. This limits the CSI to run on non FIPS environments.
We're also excluding support for DFS (Distributed File System) at GA, we will look at possible support in a future OCP release.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Operator and driver meets the GA quality criteria. We have a good way to deploy a CIFS backend for CI/Testing.
Identify all upstream capabilities and limitation. Define what we support at GA.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Yes |
Hosted control planes | Should work |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | yes |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64 |
Operator compatibility | OLM |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
We have several customer's request to allows pods to access storage shared exposed as SMB/CIFS. This can be because of already existing data generated outside OCP or because the customer's environment already integrates an AD/SMB NAS infrastructure. This is fairly common in on-prem environments.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
How do automatically deploy a SMB server for automated testing?
What authentication method will we support? - NTLM style only
High-level list of items that are out of scope. Initial completion during Refinement status.
Support of SMB server
Authentication beyond the default one which references secrets in the SC & static provisioning, NTLM style only.
No kerberos support until we have it officialy supported and documented. This limits the CSI to run on non FIPS environments.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
The windows container team can't directly leverage this work atm because they can't ship CSI drivers for windows.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Customers may want to run these on FIPS enabled clusters which requires keberos authentication as NTLM is not FIPS compliant. Unfortunately there is no official OCP kerberos support today. This will be reassessed when we have it.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Reuse the TP doc, remove TP warning. Change any delta content between TP and GA. Be explicit on supported authentification (NTML/ no FIPS) and samba / windows versions supported.
We're also excluding support for DFS (Distributed File System) at GA, we will look at possible support in a future OCP release.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Customers using windows containers may be interested by that feature.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The Azure File CSI driver currently lacks cloning and snapshot restore features. The goal of this feature is to support the cloning feature as technology preview. This will help support snapshots restore in a future release
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
This feature only applies to OCP running on Azure / ARO and File CSI.
The usual CSI cloning CI must pass.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all although SNO is rare on Azure |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86 |
Operator compatibility | Azure File CSI operator |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | ship downstream images with from forked azcopy |
High-level list of items that are out of scope. Initial completion during Refinement status.
Restoring snapshots are out of scope for now.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Update the CSI capability matrix and any language that mentions that Azure File CSI does not support cloning.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Not impact but benefit Azure / ARO customers.
Epic Goal*
Azure File added support for cloning volumes which relies on azcopy command upstream. We need to fork azcopy so we can build and ship downstream images with from forked azcopy. AWS driver does the same with efs-utils.
Upstream repo: https://github.com/Azure/azure-storage-azcopy
NOTE: using snapshots as a source is currently not supported: https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/7591a06f5f209e4ef780259c1631608b333f2c20/pkg/azurefile/controllerserver.go#L732
Why is this important? (mandatory)
This is required for adding Azure File cloning feature support.
Scenarios (mandatory)
1. As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.
Dependencies (internal and external) (mandatory)
1) Write OpenShift enhancement (STOR-1757)
2) Fork upstream repo (STOR-1716)
3) Add ART definition for OCP Component (STOR-1755)
4) Use the new image as base image for Azure File driver (STOR-1794)
5) Ensure e2e cloning tests are in CI (STOR-1818)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Downstream Azure File driver image must include azcopy and cloning feature must be tested.
Drawbacks or Risk (optional)
No risks detected so far.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Once the azure-file clone is supported, we should add clone test in our pre-submit/periodic CI.
The "pvcDataSource: true" should be added.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
In order to remove IPI/UPI support for Alibaba Cloud in OpenShift (currently Tech Preview, see also OCPSTRAT-1042), we need to provide an alternate method for Alibaba Cloud customers to spin up an OpenShift cluster. To that end, we want customers to use Assisted Installer with platform=none (and later platform=external) to bring up their OpenShift clusters.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | Multi-node |
Connected / Restricted Network | Connected for OCP 4.16 (Future: restricted) |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64 |
Operator compatibility | This should be the same for any operator on platform=none |
Backport needed (list applicable versions) | OpenShift 4.16 onwards |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Hybrid Cloud Console changes needed |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
For OpenShift 4.16, we want to remove IPI support (currently Tech Preview) for Alibaba Cloud support (OCPSTRAT-1042). Instead we want it to make it Assisted Installer (Tech Preview) with the agnostic platform for Alibaba Cloud in OpenShift 4.16 (OCPSTRAT-1149).
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Previous UPI-based installation doc: Alibaba Cloud Red Hat OpenShift Container Platform 4.6 Deployment Guide
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Assisted Installer using the agnostic platform (platform=none) for connected deployments.
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
{}USER STORY:{}
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
{}DESCRIPTION:{}
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
{}Required:{}
...
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.
Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Managed |
Classic (standalone cluster) | N/A |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | N/A |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_64 ARM |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | N/A |
Other (please specify) |
Check with OCM and CAPI requirements to expose larger worker node count.
As a service provider, I want to be able to:
so that I can achieve
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
CRIO wipe is existing feature in Openshift . When node reboots CRIO wipe goes and clear the node of all images so that node boots clean . When node come back up it need access to image registry to get all images and it takes time to get all images . For telco and edge situation node might not have access to image registry and takes time to come up .
Goal of this feature is to adjust CRIO wipe to wipe only images that has been corrupted because of sudden reboot not all images
Phase 2 of the enclave support for oc-mirror with the following goals
For 4.17 timeframe
Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.
This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.
1. Create image:
$ export KUBECONFIG=kubeconfig-of-target-cluster $ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker
2. Boot image
3. Check progress
$ oc adm add-node
An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:
With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.
In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.
This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.
This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).
Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.
Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.
Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.
Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters.
This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.
We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.
Day 2 node addition with agent image.
Yet Another Day 2 Node Addition Commands Proposal
Enable day2 add node using agent-install: AGENT-682
Add an integration test to verify that the add-nodes command generates correctly the ISO.
review the proper usage & download of the envtest related binaries (api-server and etcd)
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Enable GCP Workload Identity Webhook
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Provide GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.{}
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Both, the scope of this is for self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64 |
Operator compatibility | TBD |
Backport needed (list applicable versions) | N/A |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | TBD |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Just like AWS STS and ARO Entra Workload ID, we want to provide the GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Will require following
Background
Once we have forked the webhook, we need to configure the operator to deploy similar to how we do for the other platforms.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
installing into Shared VPC stuck in waiting for network infrastructure ready
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-10-225505
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then insert Shared VPC settings (see [1]) 2. activate the service account which has the minimum permissions in the host project (see [2]) 3. "create cluster" FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project.
Actual results:
1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed. 2. 2 firewall-rules are created in the service project unexpectedly (see [3]).
Expected results:
The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.
Additional info:
Description of problem:
After successful installation IPI or UPI cluster using minimum permissions, when destroying the cluster, it keeps telling error "failed to list target tcp proxies: googleapi: Error 403: Required 'compute.regionTargetTcpProxies.list' permission" unexpectedly.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-01-175607
How reproducible:
Always
Steps to Reproduce:
1. try IPI or UPI installation using minimum permissions, and make sure it succeeds 2. destroy the cluster using the same GCP credentials
Actual results:
It keeps telling below errors until timeout. 08-27 14:51:40.508 level=debug msg=Target TCP Proxies: failed to list target tcp proxies: googleapi: Error 403: Required 'compute.regionTargetTcpProxies.list' permission for 'projects/openshift-qe', forbidden ...output omitted... 08-27 15:08:18.801 level=debug msg=Target TCP Proxies: failed to list target tcp proxies: googleapi: Error 403: Required 'compute.regionTargetTcpProxies.list' permission for 'projects/openshift-qe', forbidden
Expected results:
It should not try to list regional target tcp proxies, because CAPI installation only creates global target tcp proxy. And the service account given to installer already has the required compute.targetTcpProxies permissions (see [1] and [2]).
Additional info:
FYI the latest IPI PROW CI test was about 19 days ago, where no such issue, see https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-mini-perm-custom-type-f28/1823483536926052352 Required GCP permissions for installer-provisioned infrastructure https://docs.openshift.com/container-platform/4.16/installing/installing_gcp/installing-gcp-account.html#minimum-required-permissions-ipi-gcp_installing-gcp-account Required GCP permissions for user-provisioned infrastructure https://docs.openshift.com/container-platform/4.16/installing/installing_gcp/installing-gcp-user-infra.html#minimum-required-permissions-upi-gcp_installing-gcp-user-infra
Description of problem:
Shared VPC installation using service account having all required permissions failed due to cluster operator ingress degraded, by telling error "error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc'"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-multi-2024-08-07-221959
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", then insert the interested settings (see [1]) 2. "create cluster" (see [2])
Actual results:
Installation failed, because cluster operator ingress degraded (see [2] and [3]). $ oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress False True True 113m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc', forbidden... $ In fact the mentioned k8s firewall-rule doesn't exist in the host project (see [4]), and, the given service account does have enough permissions (see [6]).
Expected results:
Installation succeeds, and all cluster operators are healthy.
Additional info:
\
Document all implementation steps and requirements to configure RHOSO's telemetry-operator to scrap an arbitrary external endpoint (which would be in our case ACM's monitoring operator in OpenShift) to add metrics.
Identify the minimum required access level to add scraping endpoints and OpenShift UI dashboards.
The objective is to create a comprehensive backup and restore mechanism for HCP OpenShift Virtualization Provider. This feature ensures both the HCP state and the worker node state are backed up and can be restored efficiently, addressing the unique requirements of KubeVirt environments.
The HCP team has delivered OADP backup and restore steps for the Agent and AWS provider here. We need to add the steps necessary to make these steps work for HCP KubeVirt clusters.
document this process in the upstream hypershift documentation.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Allow customer to enabled EFS CSI usage metrics.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
OCP already supports exposing CSI usage metrics however the EFS metrics are not enabled by default. The goal of this feature is to allows customers to optionally turn on EFS CSI usage metrics in order to see them in the OCP console.
The EFS metrics are not enabled by default for a good reason as it can potentially impact performances. It's disabled in OCP, because the CSI driver would walk through the whole volume, and that can be very slow on large volumes. For this reason, the default will remain the same (no metrics), customers would need explicitly opt-in.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Clear procedure on how to enable it as a day 2 operation. Default remains no metrics. Once enabled the metrics should be available for visualisation.
We should also have a way to disable metrics.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | AWS only |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all AWS/EFS supported |
Operator compatibility | EFS CSI operator |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Should appear in OCP UI automatically |
Other (please specify) | OCP on AWS only |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP user i want to be able to visualise the EFS CSI metrics
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Additional metrics
Enabling metrics by default.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Customer request as per
https://issues.redhat.com/browse/RFE-3290
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
We need to be extra clear on the potential performance impact
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Document how to enable CSI metrics + warning about the potential performance impact.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
It can benefit any cluster on AWS using EFS CSI including ROSA
Epic Goal*
This goal of this epic is to provide a way to admin to turn on EFS CSI usage metrics. Since this could lead to performance because the CSI driver would walk through the whole volume this option will not be enabled by default; admin will need to explicitly opt-in.
Why is this important? (mandatory)
Turning on EFS metrics allows users to monitor how much EFS space is being used by OCP.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Enable CSI metrics via the operator - ensure the driver is started with the proper cmdline options. Verify that the metrics are sent and exposed to the users.
Drawbacks or Risk (optional)
Metrics are calculated by walking through the whole volume which can impact performances. For this reason enabling CSI metrics will need an explicit opt-in from the admin. This risk needs to be explicitly documented.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As a product manager or business owner of OpenShift Lightspeed. I want to track who is using what feature of OLS and WHY. I also want to track the product adoption rate so that I can make decision about the product ( add/remove feature , add new investment )
Enable moniotring of OLS by defult when a user install OLS operator ---> check the box by defualt
Users will have the ability to disable the monitoring by . ----> in check the box
Refer to this slack conversation :https://redhat-internal.slack.com/archives/C068JAU4Y0P/p1723564267962489
AWS CAPI implementation supports "Tenancy" configuration option: https://pkg.go.dev/sigs.k8s.io/cluster-api-provider-aws@v1.5.0/api/v1beta1#AWSMachineSpec
This option corresponds to functionality OCP currently exposes through MAPI:
This option is currently in use by existing ROSA customers, and will need to be exposed in HyperShift NodePools
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This requires a feature gate.
wrap nodePool tenancy API field in a struct, to group and easily add new placement options to the API in the future.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This requires a feature gate.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Introduce snapshots support for Azure File as Tech Preview
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all with Azure |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | Azure File CSI |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Already covered |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Is there any known issues, if so they should be documented.
High-level list of items that are out of scope. Initial completion during Refinement status.
N/A
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
User experience should be the same as other CSI drivers.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Can be leveraged by ARO or OSD on Azure.
Epic Goal*
Add support for snapshots in Azure File.
Why is this important? (mandatory)
We should track upstream issues and ensure enablement in OpenShift. Snapshots are a standard feature of CSI and the reason we did not support it until now was lacking upstream support for snapshot restoration.
Snapshot restore feature was added recently in upstream driver 1.30.3 which we rebased to in 4.17 - https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/1904
Furthermore we already included azcopy cli which is a depencency of cloning (and snapshots). Enabling snapshots in 4.17 is therefore just a matter of adding a sidecar, volumesnapshotclass and RBAC in csi-operator which is cheap compared to the gain.
However, we've observed a few issues with cloning that might need further fixes to be able to graduate to GA and intend releasing the cloning feature as Tech Preview in 4.17 - since snapshots are implemented with azcopy too we expect similar issues and suggest releasing snapshot feature also as Tech Preview first in 4.17.
Scenarios (mandatory)
Users should be able to create a snapshot and restore PVC from snapshots.
Dependencies (internal and external) (mandatory)
azcopy - already added in scope of cloning epic
upstream driver support for snapshot restore - already added via 4.17 rebase
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Introduce snapshots support for Azure File as Tech Preview
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all with Azure |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | Azure File CSI |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Already covered |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Is there any known issues, if so they should be documented.
High-level list of items that are out of scope. Initial completion during Refinement status.
N/A
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
User experience should be the same as other CSI drivers.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Can be leveraged by ARO or OSD on Azure.
This feature only covers the downstream MAPI work to Enable Capacity Blocks.
Capacity Blocks is needed in managed OpenShift (ROSA with Hosted Control Planes) via CAPI. Once the HCP feature and OCM feature are completed then a Service Consumer can use upstream CAPI to set Capacity reservations in ROSA+HCP cluster.
Epic to track work done in https://github.com/openshift/machine-api-provider-aws/pull/110
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
Epic Goal*
Remove the Shared Resource CSI Driver as a tech preview feature.
Why is this important? (mandatory)
Shared Resources was originally introduced as a tech preview feature in OpenShift Container Platform. After extensive review, we have decided to GA this component through the Builds for OpenShift layered product.
Expected GA will be alongside OpenShift 4.16. Therefore it is safe to remove in OpenShift 4.17
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Ensure CSI Stack for Azure is running on management clusters with hosted control planes, allowing customers to associate a cluster as "Infrastructure only" and move the following parts of the stack:
This feature enables customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Non-CSI Stack for Azure-related functionalities are out of scope for this feature.
Workload identity authentication is not covered by this feature - see STOR-1748
This feature is designed to enable customers to run their Azure infrastructure more efficiently and cost-effectively by using HyperShift control planes and supporting infrastructure without incurring additional charges from Red Hat.
Documentation for this feature should provide clear instructions on how to enable the CSI Stack for Azure on management clusters with hosted control planes and associate a cluster as "Infrastructure only." It should also include instructions on how to move the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to the appropriate clusters.
This feature impacts the CSI Stack for Azure and any layered products that interact with it. Interoperability test scenarios should be factored by the layered products.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Run Azure File CSI driver operator + Azure File CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".
Why is this important? (mandatory)
This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Scenarios (mandatory)
When leveraging Hosted control planes, the Azure File CSI driver operator + Azure File CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.
Dependencies (internal and external) (mandatory)
Hosted control plane on Azure.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As part of this story, we will simply move building and CI of existing code to combined csi-operator.
We need to modify csi-operator so as it be ran as azure-file operator on hypershift and standalone clusters.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Run Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".
Why is this important? (mandatory)
This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.
Scenarios (mandatory)
When leveraging Hosted control planes, the Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.
Dependencies (internal and external) (mandatory)
Hosted control plane on Azure.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
As part of this epic, Engineers working on Azure Hypershift should be able to build and use Azure Disk storage on hypershift guests via developer preview custom build images.
For this story, we are going to enable deployment of azure disk driver and operator by default in hypershift environment.
Place holder epic to capture all azure tickets.
TODO: review.
As an end user of a hypershift cluster, I want to be able to:
so that I can achieve
From slack thread: https://redhat-external.slack.com/archives/C075PHEFZKQ/p1722615219974739
We need 4 different certs:
Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
Why is this important? (mandatory)
An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Finally switch both CI and ART to the refactored aws-ebs-csi-driver-operator.
The functionality and behavior should be the same as the existing operator, however, the code is completely new. There could be some rough edges. See https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md
Ci should catch the most obvious errors, however, we need to test features that we do not have in CI. Like:
Out CSI driver YAML files are mostly copy-paste from the initial CSI driver (AWS EBS?).
As OCP engineer, I want the YAML files to be generated, so we can keep consistency among the CSI drivers easily and make them less error-prone.
It should have no visible impact on the resulting operator behavior.
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
粗文本*h3. *Feature Overview
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
This section contains all the test cases that we need to make sure work as part of the done^3 criteria.
This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.
For this task, we need to create a new periodical that will test the multi vcenter feature.
Add authentication to the internal components of the Agent Installer so that the cluster install is secure.
Requirements
Are there any requirements specific to the auth token?
Actors:
Do we need more than one auth scheme?
Agent-admin - agent-read-write
Agent-user - agent-read
Options for Implementation:
As a user, when creating node ISOs, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
TechPreview featureSet check added in installer for userLabels and userTags should be removed and the TechPreview reference made in the install-config GCP schema should be removed.
Acceptance Criteria
TechPreview featureSet check added in machine-api-provider-gcp operator for userLabels and userTags.
And the new featureGate added in openshift/api should also be removed.
Acceptance Criteria
This Feature covers effort in person-weeks of meetings in #wg-managed-ocp-versions where OTA helped SD refine how their doing OCM work would help, and what that OCM work might look like https://issues.redhat.com/browse/OTA-996?focusedId=25608383&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25608383.
Currently the ROSA/ARO versions are not managed by OTA team.
This Feature covers the engineering effort to take the responsibility of management of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.
Here is the design document for the effort: https://docs.google.com/document/d/1hgMiDYN9W60BEIzYCSiu09uV4CrD_cCCZ8As2m7Br1s/edit?skip_itp2_check=true&pli=1
Here are some objectives :
Presentation from Jeremy Eder :
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
This epic is to transfer the responsibility of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
To make it easier to debug when OTA-1211 configurating causes issues with retrieving update recommendations.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
Failed to create second cluster in shared vnet, below error is thrown out during creating network infrastructure when creating 2nd cluster, installer timed out and exited. ============== 07-23 14:09:27.315 level=info msg=Waiting up to 15m0s (until 6:24AM UTC) for network infrastructure to become ready... ... 07-23 14:16:14.900 level=debug msg= failed to reconcile cluster services: failed to reconcile AzureCluster service loadbalancers: failed to create or update resource jima0723b-1-x6vpp-rg/jima0723b-1-x6vpp-internal (service: loadbalancers): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal 07-23 14:16:14.900 level=debug msg= -------------------------------------------------------------------------------- 07-23 14:16:14.901 level=debug msg= RESPONSE 400: 400 Bad Request 07-23 14:16:14.901 level=debug msg= ERROR CODE: PrivateIPAddressIsAllocated 07-23 14:16:14.901 level=debug msg= -------------------------------------------------------------------------------- 07-23 14:16:14.901 level=debug msg= { 07-23 14:16:14.901 level=debug msg= "error": { 07-23 14:16:14.901 level=debug msg= "code": "PrivateIPAddressIsAllocated", 07-23 14:16:14.901 level=debug msg= "message": "IP configuration /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal/frontendIPConfigurations/jima0723b-1-x6vpp-internal-frontEnd is using the private IP address 10.0.0.100 which is already allocated to resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd.", 07-23 14:16:14.902 level=debug msg= "details": [] 07-23 14:16:14.902 level=debug msg= } 07-23 14:16:14.902 level=debug msg= } 07-23 14:16:14.902 level=debug msg= -------------------------------------------------------------------------------- Install-config for 1st cluster: ========= metadata: name: jima0723b platform: azure: region: eastus baseDomainResourceGroupName: os4-common networkResourceGroupName: jima0723b-rg virtualNetwork: jima0723b-vnet controlPlaneSubnet: jima0723b-master-subnet computeSubnet: jima0723b-worker-subnet publish: External Install-config for 2nd cluster: ======== metadata: name: jima0723b-1 platform: azure: region: eastus baseDomainResourceGroupName: os4-common networkResourceGroupName: jima0723b-rg virtualNetwork: jima0723b-vnet controlPlaneSubnet: jima0723b-master-subnet computeSubnet: jima0723b-worker-subnet publish: External shared master subnet/worker subnet: $ az network vnet subnet list -g jima0723b-rg --vnet-name jima0723b-vnet -otable AddressPrefix Name PrivateEndpointNetworkPolicies PrivateLinkServiceNetworkPolicies ProvisioningState ResourceGroup --------------- ----------------------- -------------------------------- ----------------------------------- ------------------- --------------- 10.0.0.0/24 jima0723b-master-subnet Disabled Enabled Succeeded jima0723b-rg 10.0.1.0/24 jima0723b-worker-subnet Disabled Enabled Succeeded jima0723b-rg internal lb frontedIPConfiguration on 1st cluster: $ az network lb show -n jima0723b-49hnw-internal -g jima0723b-49hnw-rg --query 'frontendIPConfigurations' [ { "etag": "W/\"7a7531ca-fb02-48d0-b9a6-d3fb49e1a416\"", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd", "inboundNatRules": [ { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-0", "resourceGroup": "jima0723b-49hnw-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-1", "resourceGroup": "jima0723b-49hnw-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-2", "resourceGroup": "jima0723b-49hnw-rg" } ], "loadBalancingRules": [ { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/LBRuleHTTPS", "resourceGroup": "jima0723b-49hnw-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/sint-v4", "resourceGroup": "jima0723b-49hnw-rg" } ], "name": "jima0723b-49hnw-internal-frontEnd", "privateIPAddress": "10.0.0.100", "privateIPAddressVersion": "IPv4", "privateIPAllocationMethod": "Static", "provisioningState": "Succeeded", "resourceGroup": "jima0723b-49hnw-rg", "subnet": { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-rg/providers/Microsoft.Network/virtualNetworks/jima0723b-vnet/subnets/jima0723b-master-subnet", "resourceGroup": "jima0723b-rg" }, "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations" } ] From above output, privateIPAllocationMethod is static and always allocate privateIPAddress to 10.0.0.100, this might cause the 2nd cluster installation failure. Checked the same on cluster created by using terraform, privateIPAllocationMethod is dynamic. =============== $ az network lb show -n wxjaz723-pm99k-internal -g wxjaz723-pm99k-rg --query 'frontendIPConfigurations' [ { "etag": "W/\"e6bec037-843a-47ba-a725-3f322564be58\"", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/frontendIPConfigurations/internal-lb-ip-v4", "loadBalancingRules": [ { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/api-internal-v4", "resourceGroup": "wxjaz723-pm99k-rg" }, { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/sint-v4", "resourceGroup": "wxjaz723-pm99k-rg" } ], "name": "internal-lb-ip-v4", "privateIPAddress": "10.0.0.4", "privateIPAddressVersion": "IPv4", "privateIPAllocationMethod": "Dynamic", "provisioningState": "Succeeded", "resourceGroup": "wxjaz723-pm99k-rg", "subnet": { "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-rg/providers/Microsoft.Network/virtualNetworks/wxjaz723-vnet/subnets/wxjaz723-master-subnet", "resourceGroup": "wxjaz723-rg" }, "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations" }, ... ]
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Create shared vnet / master subnet / worker subnet 2. Create 1st cluster in shared vnet 3. Create 2nd cluster in shared vnet
Actual results:
2nd cluster installation failed
Expected results:
Both clusters are installed successfully.
Additional info:
Description of problem:
Install Azure fully private IPI cluster by using CAPI with payload built from cluster bot including openshift/installer#8727,openshift/installer#8732, install-config: ================= platform: azure: region: eastus outboundType: UserDefinedRouting networkResourceGroupName: jima24b-rg virtualNetwork: jima24b-vnet controlPlaneSubnet: jima24b-master-subnet computeSubnet: jima24b-worker-subnet publish: Internal featureSet: TechPreviewNoUpgrade Checked storage account created by installer, its property allowBlobPublicAccess is set to True. $ az storage account list -g jima24b-fwkq8-rg --query "[].[name,allowBlobPublicAccess]" -o tsv jima24bfwkq8sa True This is not consistent with terraform code, https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L74 At least, storage account should have no public access for fully private cluster.
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Create fully private cluster 2. Check storage account created by installer 3.
Actual results:
storage account have public access on fully private cluster.
Expected results:
storage account should have no public access on fully private cluster.
Additional info:
Description of problem:
In install-config file, there is no zone/instance type setting under controlplane or defaultMachinePlatform ========================== featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstallAzure=true compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 create cluster, master instances should be created in multi zones, since default instance type 'Standard_D8s_v3' have availability zones. Actually, master instances are not created in any zone. $ az vm list -g jima24a-f7hwg-rg -otable Name ResourceGroup Location Zones ------------------------------------------ ---------------- -------------- ------- jima24a-f7hwg-master-0 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-master-1 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-master-2 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-worker-southcentralus1-wxncv jima24a-f7hwg-rg southcentralus 1 jima24a-f7hwg-worker-southcentralus2-68nxv jima24a-f7hwg-rg southcentralus 2 jima24a-f7hwg-worker-southcentralus3-4vts4 jima24a-f7hwg-rg southcentralus 3
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. CAPI-based install on azure platform with default configuration 2. 3.
Actual results:
master instances are created but not in any zone.
Expected results:
master instances should be created per zone based on selected instance type, keep the same behavior as terraform based install.
Additional info:
When setting zones under controlPlane in install-config, master instances can be created per zone. install-config: =========================== controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: zones: ["1","3"] $ az vm list -g jima24b-p76w4-rg -otable Name ResourceGroup Location Zones ------------------------------------------ ---------------- -------------- ------- jima24b-p76w4-master-0 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-master-1 jima24b-p76w4-rg southcentralus 3 jima24b-p76w4-master-2 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-worker-southcentralus1-bbcx8 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-worker-southcentralus2-nmgfd jima24b-p76w4-rg southcentralus 2 jima24b-p76w4-worker-southcentralus3-x2p7g jima24b-p76w4-rg southcentralus 3
Description of problem:
Launch CAPI based installation on Azure Government Cloud, installer was timeout when waiting for network infrastructure to become ready. 06-26 09:08:41.153 level=info msg=Waiting up to 15m0s (until 9:23PM EDT) for network infrastructure to become ready... ... 06-26 09:09:33.455 level=debug msg=E0625 21:09:31.992170 22172 azurecluster_controller.go:231] "failed to reconcile AzureCluster" err=< 06-26 09:09:33.455 level=debug msg= failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= RESPONSE 404: 404 Not Found 06-26 09:09:33.456 level=debug msg= ERROR CODE: SubscriptionNotFound 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= { 06-26 09:09:33.456 level=debug msg= "error": { 06-26 09:09:33.456 level=debug msg= "code": "SubscriptionNotFound", 06-26 09:09:33.456 level=debug msg= "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found." 06-26 09:09:33.456 level=debug msg= } 06-26 09:09:33.456 level=debug msg= } 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= . Object will not be requeued 06-26 09:09:33.456 level=debug msg= > logger="controllers.AzureClusterReconciler.reconcileNormal" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" reconcileID="f2ff1040-dfdd-4702-ad4a-96f6367f8774" x-ms-correlation-request-id="d22976f0-e670-4627-b6f3-e308e7f79def" name="jima26mag-9bqkl" 06-26 09:09:33.457 level=debug msg=I0625 21:09:31.992215 22172 recorder.go:104] "failed to reconcile AzureCluster: failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: SubscriptionNotFound\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"code\": \"SubscriptionNotFound\",\n \"message\": \"The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.\"\n }\n}\n--------------------------------------------------------------------------------\n. Object will not be requeued" logger="events" type="Warning" object={"kind":"AzureCluster","namespace":"openshift-cluster-api-guests","name":"jima26mag-9bqkl","uid":"20bc01ee-5fbe-4657-9d0b-7013bd55bf96","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"1115"} reason="ReconcileError" 06-26 09:17:40.081 level=debug msg=I0625 21:17:36.066522 22172 helpers.go:516] "returning early from secret reconcile, no update needed" logger="controllers.reconcileAzureSecret" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" name="jima26mag-9bqkl" reconcileID="2df7c4ba-0450-42d2-901e-683de399f8d2" x-ms-correlation-request-id="b2bfcbbe-8044-472f-ad00-5c0786ebbe84" 06-26 09:23:46.611 level=debug msg=Collecting applied cluster api manifests... 06-26 09:23:46.611 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure is not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline 06-26 09:23:46.611 level=info msg=Shutting down local Cluster API control plane... 06-26 09:23:46.612 level=info msg=Stopped controller: Cluster API 06-26 09:23:46.612 level=warning msg=process cluster-api-provider-azure exited with error: signal: killed 06-26 09:23:46.612 level=info msg=Stopped controller: azure infrastructure provider 06-26 09:23:46.612 level=warning msg=process cluster-api-provider-azureaso exited with error: signal: killed 06-26 09:23:46.612 level=info msg=Stopped controller: azureaso infrastructure provider 06-26 09:23:46.612 level=info msg=Local Cluster API system has completed operations 06-26 09:23:46.612 [[1;31mERROR[0;39m] Installation failed with error code '4'. Aborting execution. From above log, Azure Resource Management API endpoint is not correct, endpoint "management.azure.com" is for Azure Public cloud, the expected one for Azure Government should be "management.usgovcloudapi.net".
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. Install cluster on Azure Government Cloud, capi-based installation 2. 3.
Actual results:
Installation failed because of the wrong Azure Resource Management API endpoint used.
Expected results:
Installation succeeded.
Additional info:
Description of problem:
CAPZ creates an empty route table during installs
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Very
Steps to Reproduce:
1.Install IPI cluster using CAPZ 2. 3.
Actual results:
Empty route table created and attached to worker subnet
Expected results:
No route table created
Additional info:
Epic Goal*
There was an epic / enhancement to create a cluster-wide TLS config that applies to all OpenShift components:
https://issues.redhat.com/browse/OCPPLAN-4379
https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/tls-config.md
For example, this is how KCM sets --tls-cipher-suites and --tls-min-version based on the observed config:
https://issues.redhat.com/browse/WRKLDS-252
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/506/files
The cluster admin can change the config based on their risk profile, but if they don't change anything, there is a reasonable default.
We should update all CSI driver operators to use this config. Right now we have a hard-coded cipher list in library-go. See OCPBUGS-2083 and OCPBUGS-4347 for background context.
Why is this important? (mandatory)
This will keep the cipher list consistent across many OpenShift components. If the default list is changed, we get that change "for free".
It will reduce support calls from customers and backport requests when the recommended defaults change.
It will provide flexibility to the customer, since they can set their own TLS profile settings without requiring code change for each component.
Scenarios (mandatory)
As a cluster admin, I want to use TLSSecurityProfile to control the cipher list and minimum TLS version for all CSI driver operator sidecars, so that I can adjust the settings based on my own risk assessment.
Dependencies (internal and external) (mandatory)
None, the changes we depend on were already implemented.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We want to stop building the kube-proxy image out of the openshift-sdn repo, and start building it out of the openshift/kubernetes repo along with the other kubernetes binaries.
Networking Definition of Planned
Epic Template descriptions and documentation
openshift-sdn is no longer part of OCP in 4.17, so remove references to it in the networking APIs.
Consider whether we can remove the entire network.openshift.io API, which will now be no-ops.
In places where both sdn and ovn-k are supported, remove references to sdn.
In some places (notably the migration API), we will probably leave an API in place that currently has no purpose.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Goal:
As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.
Problem:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Currently, the `master.ign` contains the URL from which to download the actual Ignition. On cloud platforms, this value is:
"source":"https://api-int.<cluster domain>:22623/config/master"
Update this value with the API-Int LB IP when custom-dns is enabled on the GCP platform.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This feature is to track automation in ODC, related packages, upgrades and some tech debts
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | No |
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
This won't impact documentation and this feature is to mostly enhance end to end test and job runs on CI
Questions to be addressed:
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Alternate scenario
#acm-1290-rename-local-cluster
Remove hard-coded local-cluster from the import local cluster feature and verify that we don't use it in the infrastructure operator
Testing the import local cluster and checking the behavior after the upgrade.
Yes.
No
No
Presently the name of the local-cluster is hardwired to "local-cluster" in the local cluster import tool.
It is possible to redefine the name of the "local-cluster" in ACM then the correct local-cluster name needs to be picked up and used by the ManagedCluster.
Suggested approach
1: Obtain the correct "local-cluster" name from the ManagedCluster CR that has been labelled as "local-cluster"
2: Use this name to import the local cluster, annotate the created AgentServiceConfig, ClusterDeployment and InfraEnv as a "local cluster"
3: Handle any updates to ManagedCluster to keep the name in sync.
4: During deletion of local cluster CRs, this annotation may be used to identify CRs to be deleted.
This will leave an edge case, there will be an AgentServiceConfig, ClusterDeployment and InfraEnv "left behind" for any users who have renamed their ManagedCluster and then performed an upgrade to this new version. Those users will need to manually remove these CR's. (I will discuss further with ACM to determine a suitable course of action here.)
This makes the following assumptions, which should also be checked with the ACM team.
1: ACM users may rename their "local-cluster" in ACM (meaning that we should pick this change up)
2: ACM will use the label "local-cluster" in the ManagedCluster CR to signify a local cluster
3: There will only be one "local-cluster" in ACM (note that it's possible to add a label arbitrarily so this may not be properly enforceable.)
Requirement description:
As an VM Admin, I want to improve overall density. In our traditional VM environments, we find that we are memory bound much more than CPU. Even with properly sized VMs, we see a lot of memory just sitting around allocated to the VM, but not actually used. Moreover, we always see people requesting VMs that are sized way too big for their workloads. It is better customer service allow it to some degree and then recover the memory at the hypervisor level.
MVP:
Documents:
Prometheus query for UI:
sum by (instance)(((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) + (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes)) / node_memory_MemTotal_bytes) *100
In human words: This is approximating how much over-committment of memory is taking place. A value of 100 means RAM+SWAP usage are 100% of system RAM capacity. 105% means RAM+SWAP are factor 105% of system RAM capacity.
Threshold: Yellow 95%, Red 105%
Based on: https://docs.google.com/document/d/1AbR1LACNMRU2QMqFpe-Se2mCEFLMqW_M9OPKh2v3yYw,
https://docs.google.com/document/d/1E1joajwxQChQiDVTsr9Qk_iIhpQkSI-VQP-o_BMx8Aw
Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.
Provide a network solution working out of the box, meeting expectations of a typical VM workload.
Primary used-defined networks can be managed from the UI and the user flow is seamless.
The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
The dev console page displays fewer dashboards than the admin version of the page, so that difference will need to be supported by monitoring-plugin.
The admin console's silences page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and OU-561 are completed.
Proposed title of this feature request
Fleet / Multicluster Alert Management User Interface
What is the nature and description of the request?
Large enterprises are drowning in cluster alerts.
side note: Just within my demo RHACM Hub environment, across 12 managed clusters (OCP, SNO, ARO, ROSA, self-managed HCP, xKS), I have 62 alerts being reported! And I have no idea what to do about them!
Customers need the ability to interact with alerts in a meaningful way, to leverage a user interface that can filter, display, multi-select, sort, etc. To multi-select and take actions, for example:
Why does the customer need this? (List the business requirements)
Platform engineering (sys admin; SRE etc) must maintain the health of the cluster and ensure that the business applications are running stable. There might indeed be another tool and another team which focuses on the Application health itself, but for sure the platform team is interested to ensure that the platform is running optimally and all critical alerts are responded to.
As of TODAY, what the customer must do is perform alert management via CLI. This is tedious, ad-hoc, and error prone. see blog link
The requirements are:
List any affected packages or components.
OCP console Observe dynamic plugin
ACM Multicluster observability (MCO operator)
"In order to provide ACM with the same monitoring capabilities OCP has, we as the Observability UI Team need to allow the monitoring plugin to be installed and work in ACM environments."
Product Requirements:
UX Requirements:
In order for ACM to reuse the monitoring plugin, the plugin needs to connect to a different alert manager. It needs to also contain a new column in alerts to show the source cluster these alerts are generated from
Check the ACM documentation around alerts for reference: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/observability/observing-environments-intro#observability-arch
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Insights operator should replaces %s in https://console.redhat.com/api/gathering/v2/%s/gathering_rules error messages like the failed-to-bootstrap:
$ jq -r .content osd-ccs-gcp-ad-install.log | sed 's/\\n/\n/g' | grep 'Cluster operator insights' time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED" time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: " time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules" time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: " time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED level=info msg=Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%27REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED level=info msg=Cluster operator insights Disabled is False with AsExpected: level=info msg=Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules level=info msg=Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: level=info msg=Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: {\"errors\":[{\"meta\":{\"response_by\":\"gateway\"},\"detail\":\"UHC services authentication failed\",\"status\":401}]}
Seen in 4.17 RCs. Also in this comment.
Unknown
Unknown.
ClusterOperator conditions talking about https://console.redhat.com/api/gathering/v2/%s/gathering_rules
URIs we expose in customer-oriented messaging to not have %s placeholders.
Seems like the template is coming in as conditionalGathererEndpoint here. Seems like insights-operator#964 introduced the %s, but I'm not finding the logic that's supposed to populate that placeholder.
Description of problem:
When the Insights Operator is disabled (as described in the docs here or here), the RemoteConfigurationAvailable and RemoteConfigurationValid clusteroperator conditions are reporting the previous (before distabling the gathering) state (which might be Available=True and Valid=True).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Disable the data gathering in the Insights Operator followings the docs links above 2. Watch the clusteroperator conditions with "oc get co insights -o json | jq .status.conditions" 3.
Actual results:
Expected results:
Additional info:
Rapid recommendations enhancement defines this built-in configuration when the operator cannot reach the remote endpoint.
The issue is that the built-in configuration (though currently empty) is no taken into account - i.e the data requested in the built-configuration is not gathered.
With the rapid recommendations feature (enhancement) one can request various messages from Pods matching various Pod name regular expressions
The problem is when there is a Pod (e.g foo-1 from the below example) matching more than one requested Pod name regex:
{ 'namespace': 'test-namespace', 'pod_name_regex': 'foo-.*', 'messages': ['regex1', 'regex2'] }, { 'namespace': 'test-namespace'', 'pod_name_regex': 'foo-1', 'messages': ['regex3', 'regex4'] }
Assume Pods with names foo-1 and foo-bar. Currently all the regexes (regex1,regex2, regex3, regex4) are filtered for both Pods.
The desired behavior is foo1 filters all the regexes, but foo-bar is filtered only with regex1 and regex2
Goal:
Track Insights Operator Data Enhancements epic in 2024
Description
We can remove all the hardcoded container log gatherers (except the conditionals) in favor of Rapid Recommendations approach. They can be remove in the 4.18 version
Context:
As we discussed in INSIGHTOCP-1814 , it's a good candidate can help customers to fix the issue caused by too many unused MachineConfigs.
Required Data:
The total number of MachineConfigs in the cluster, the unused number of MachineConfigs in the cluster.
Backports:
To the OCP version we supported.
Proposed title of this feature request
Container scanner aims to gain data necessary for business analytics of usage of RH MW portfolio in live fleet.
The request includes assistance with on-boarding container scanner, help bringing it up to Insights Operator standards. GA quality requires performance and scalability QE on top of the functional testing alone.
Enhancement proposal tracked at: https://github.com/openshift/enhancements/pull/1584/files
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Provisioning bootstrap and control plane machines using CAPI.
RHCOS Image Preparation as Pre-Infrastructure Provisioning Task
InfraReady (post infrastructure) Provisioning
Hosted Control Planes and HyperShift provide consumers with a different architectural path to OpenShift that aligns best with their multi-cluster deployment needs. However, today’s API surface area in HCP remains “like a box of chocolates you never know what you're gonna get”~ Forrest Gump. Sometimes gated best-effort via the `hcp` cli (which is suboptimal).
The goal of this feature is to build a standard for communicating features that are GA/Preview. This would allow us:
This can be done following the guidelines in the FeatureGate FAQ. For example, by introducing a structured system of feature gates in our hosted control plane API, such that features are categorized into 'n-by-default', 'accessible-by-default', 'inaccessible-by-default or TechPreviewNoUpgrade', and 'Tech Preview', we would be ensuring clarity, compliance, and a smooth development and user experience.
There are other teams (e.g., the assisted installer) teams following a structured pattern for gating features:
Currently there's no rigorous technical mechanism to feature gate functionality nor APIs in hypershift.
We defer to docs which results in bad UX, consumer confusion and maintainability burden.
We should have technical implementation to allow features and APIs to only run behind a flag.
Currently there's no rigorous technical mechanism to feature gate functionality nor APIs in hypershift.
We defer to docs which results in bad UX, consumer confusion and maintainability burden.
We should have technical implementation to allow features and APIs to only run behind a flag.
As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.
Background:
This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.
These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:
Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".
Definition of done:
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal
Use scenarios
Why is this important
Requirement | Notes |
---|---|
OCI Bare Metal Shapes must be certified with RHEL | It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot ( Certified shapes: https://catalog.redhat.com/cloud/detail/249287 |
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. | Oracle will do these tests. |
Updating Oracle Terraform files | |
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. | Support Oracle Cloud in Assisted-Installer CI: |
RFEs:
Any bare metal Shape to be supported with OCP has to be certified with RHEL.
From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.
As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
To make iSCSI work, a secondary VNIC must be configured during discovery, and when the machine reboots on core OS. The configuration is almost the same for discovery and Core OS.
Currently, we have one script owned by Red Hat for discovery, and a custom manifest owned by Oracle for CoreOS configuration.
I think this configuration should be owned by Oracle because the network configuration depends on OCI API. Also, we need this script to be the same is order to ensure that the configuration applied on discovery will be the same when the machine reboots on Core OS. Finally, if a customer has a specific need, they won't be able to tailor the configuration to their needs easily, as they would have to use the REST API of the assisted service.
My suggestion is to ask Oracle to drop the configuration script in their metadata service using Oracle's terraform template. On Red Hat side, we would have to pull this script on the node, and execute it thanks to a systemd unit. The same would be done from the custom manifest provided by Oracle.
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15 when using OCI external platform.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1 ip=ibft` kargs during install to enable iSCSI booting.
yes
PR https://github.com/openshift/assisted-service/pull/6257 must be adapted to be used along external platform.
Since we ensure that the iscsi network is not the default route, the PR above will ensure that automatically select the subnet used by the default route.
The secondary VNIC must be configured manually in OCI, a script must be injected in the discovery ISO to configure it.
We are planning to support 5-node control planes to cover a set of active-active failure domains for OpenShift control planes (see OCPSTRAT-1199).
The Agent-Based Installer is required to enable this setup on day-1.
For additional context of the 5-node and 2-node control plane model please read:
We are planning to support 4/5-node control planes to cover a set of active-active failure domains for OpenShift control planes (see OCPSTRAT-1199).
Assisted Installer must support this new topology too.
For additional context of the 5-node and 2-node control plane model please read:
Currently, in HA clusters, assisted-service enforces exactly 3 control planes. This issue should change this behaviour to enable 3-5 control planes instead. It was decided in https://redhat-internal.slack.com/archives/G01A5NB3S6M/p1728296942806519?thread_ts=1727250326.825979&cid=G01A5NB3S6M that there will be no fail mechanism to continue with the installation in case one of the control planes failed to install. This issue should also align assisted-service behaviour with marking control planes as schedulable if there are less than 2 workers in the cluster, and not otherwise. It should also align assisted-service behaviour with failing installation if the user asked for at least 2 workers and got less
As a user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This does not require a design proposal.
This does not require a feature gate.
Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.
As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.
Network Policy has its issues:
With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.
Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.
Test scenarios:
Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default
Benefits of Crun is covered here https://github.com/containers/crun
FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit
***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Check with ACS team; see if there are external repercussions.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Per OCPSTRAT-1278, we want to support OCP on C3 instance type (baremetal) in order to enabled OCP virt on GCP. The C3 instance type supports the hyperdisk-balanced disks.
The goal is to validate that our GCP CSI operator can deploy the driver on C3 baremetal nodes and function as expected.
As OCP virt requires RWX to support VM live migration, we need to make sure the driver works with this access type with volumeType block.
Why is this important? (mandatory)
Product level priority to enabled OCP virt on GCP. Multiple customers are waiting for this solution. See OCPSTRAT-1278 for additional details.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
PD CSI driver to support baremetal / C3 instance type
PD CSI driver to support block RWX
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
GCP PD CSI on C3 nodes passes the regular CSI tests + RWX with volumeType block. Actual VM live migration tests will be done by the virt team.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
iSCSI boot is supported in RHEL and since the implementation of OCPSTRAT-749 it's also available in RHCOS.
Customers require using this feature in different bare metal environments on-prem and cloud-based.
Assisted Installer implements support for it in Oracle Cloud Infrastructure (MGMT-16167) to support their bare metal standard "shapes".
This feature extends this support to make it generic and supported in the Agent-Based Installer, the Assisted Installer and in ACM/MCE.
Support iSCSI boot in bare metal nodes, including platform baremetal and platform "none".
Assisted installer can boot and install OpenShift on nodes with iSCSI disks.
Agent-Based Installer can boot and install OpenShift on nodes with iSCSI disks.
MCE/ACM can boot and install OpenShift on nodes with iSCSI disks.
The installation can be done on clusters with platform baremetal and clusters with platform "none".
Support booting from iSCSI using ABI starting OCP 4.16.
The following PRs are the gaps between release-4.17 branch and master that are needed to make the integration work on 4.17.
https://github.com/openshift/assisted-service/pull/6665
https://github.com/openshift/assisted-service/pull/6603
https://github.com/openshift/assisted-service/pull/6661
The feature has to be backported to 4.16 as well. TBD - list all the PRs that have to be backported.
Instructions to test the AI feature with local env - https://docs.google.com/document/d/1RnRhJN-fgofnVSBTA6mIKcK2_UW7ihbZDLGAVHSdpzc/edit#heading=h.bf4zg53460gu
Add new systemd services ( already available in Assisted service) into ABI to enable iSCSI boot
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps
Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.
As an openshift developer, I want to implement MCO ctrcfg runtime controller watching the ImagePolicy resources. The controller will update the sigstore verification file that crio --signature-policy-dir uses for namespaced policies.
This goals of this features are:
Given Microsoft's constraints on IPv4 usage, there is a pressing need to optimize IP allocation and management within Azure-hosted environments.
Interoperability Considerations
There's currently multiple ingress strategies we support for hosted cluster service endpoints (kas, nodePort, router...).
In a context of uncertainty about what use cases would be more critical to support, we initially exposed this in a flexible API that enables to potentially choose any combination of ingress strategies and endpoints.
ARO has internal restrictions on IPv4 usage. Because of this, to simplify the above and to be more cost effective in terms of infra we'd want to have a common shared ingress solution for all hosted clusters fleet.
As a management cluster owner I want to make sure the shared ingress is resilient to cluster failures
Currently the SharedIngress controller waits for a HostedCluster to exist before creating the Service/LoadBalancer of the shared-ingress.
The controller should create the Service/LoadBalancer even
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Our goal is to be able to deploy baremetal clusters using Cluster API in Openshift.
Metal3, our upstream community, already provides a CAPI provider, and our aim is to bring it downstream.
We will collaborate with the Cluster Infrastructure team on points of integration as needed.
Scope questions
Firmware (BIOS) updates and attributes configuration from OpenShift is key in O-RAN clusters. While can do it on day 1, customers need to set firmware attributes to hosts that have already been deployed and are part of a cluster.
This feature adds the capability of updating firmware attributes and updating the firmware image for hosts in deployed clusters.
As part of demoing our integration with hardware vendors, we need to show the ability to reconfigure already provisioned hosts: modify their BIOS settings and, in the future, do firmware upgrades. The initial demo will be concentrated on BIOS settings. The demo is expected to be based on 4.15 and to use unmerged patches since 4.15 is closed for feature development. The path to productization will be determined as an outcome of the demo.
The assumed end result is an ability to run firmware upgrades and update BIOS settings for hosts that are already provisioned without fully deprovisioning them. The hosts will still be rebooted, so some external orchestrator (a human or ZTP) will need to drain the nodes first.
1. Pre-installation:
2. Installation:
3. Update:
4. Uninstallation/Deletion:
5. Disconnected Environments for High-Security Workloads:
6. [Tech Preview] Signature Validation for Secure Workflows:
All the expected user outcomes and the acceptance criteria in the engineering epics are covered.
OLM: Gateway to the OpenShift Ecosystem
Operator Lifecycle Manager (OLM) has been a game-changer for OpenShift Container Platform (OCP) 4. Since its launch in 2019, OLM has fostered a rich ecosystem, expanding from a curated set of 25 operators to over 100 officially supported Red Hat operators and hundreds more from certified ISVs and the community.
OLM empowers users to manage diverse technologies with ease, including ACM, ACS, Quay, GitOps, Pipelines, Service Mesh, Serverless, and Virtualization. It has also facilitated the introduction of groundbreaking operators for entirely new workloads, like Nvidia GPU, PTP, Windows Machine Config, SR-IOV networking, and more. Today, a staggering 91% of our connected customers leverage OLM's capabilities.
OLM v0: A Stepping Stone
While OLM v0 has been instrumental, it has limitations. The API design, not fully GitOps-friendly or entirely declarative, presents a steeper learning curve due to its complexity. Furthermore, OLM v0 was designed with the assumption of namespace-scoped CRDs (Custom Resource Definitions), allowing for independent operator installations and parallel versions within a single cluster. However, this functionality never materialized in core Kubernetes, and OLM v0's attempt to simulate it has introduced limitations and bugs.
The Operator Framework Team: Building the Future
The Operator Framework team is the cornerstone of the OpenShift ecosystem. They build and manage OLM, the Operator SDK, operator catalog formats, and tooling (opm, file-based catalogs). Their work directly impacts how operators are developed, packaged, delivered, and managed by users and SRE teams on OpenShift clusters.
A Streamlined Future with OLM v1
The Operator Framework team has undergone significant restructuring to focus on the next generation of OLM – OLM v1. This transition includes moving the Operator SDK to a feature-complete state with ongoing maintenance for compatibility with the latest Kubernetes and controller-runtime libraries. This strategic shift allows the team to dedicate resources to completely revamping OLM's API and management concepts for catalog content delivery.
Leveraging learnings and customer feedback since OCP 4's inception, OLM v1 is designed to be a major overhaul, and it will be shipped as a Generally Available (GA) feature in OpenShift 4.17.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
1. Pre-installation:
2. Installation:
3. Update:
4. Uninstallation/Deletion:
1. Pre-installation:
2. Installation:
3. Update:
4. Uninstallation/Deletion:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Typically, any non-deployment resource managed by cluster-olm-operator would be handled by a StaticResourceController (usage ref). Unfortunately, the StaticResourceController only knows how to handle specific types, as seen by the usage of the ApplyDirectly function in the StaticResourceController.Sync method. Due to the ApplyDirectly function only handling a set of known resources, the ClusterCatalog resource would likely not be handled the same as other static manifests currently managed by cluster-olm-operator.
In order to enable cluster-olm-operator to properly manage ClusterCatalog resources, it is proposed that we implement a custom factory.Controller that knows how to appropriately apply and manage ClusterCatalog resources such that:
The openshift/library-go project has a lot of packages that will likely make this implementation pretty straightforward. The custom controller implementation will likely also require implementation of some pre-condition logic that ensures the ClusterCatalog API is available on the cluster before attempting to use it.
Downstream change to add kustomize overlay for hostPath volume mount of /etc/containers
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Create a test that makes sure that the pre-defined, default, cluster catalogs are defined and are in a good state.
Th
Create a test that builds upon the catalogd happy-path, by creating a manifest image, and then updating the ClusterCatalog to references that image. Then creating a ClusterExtension to deploy the manifests.
The status of the ClusterExtension should then be checked.
The manifests do not need to create a deployment, in fact it would be better if the manifest included simpler resources such as a configmap or secret.
This will create the initial openshift/origin tests. This will consist of tests that ensure, while in tech-preview, that the ClusterExtension and ClusterCatalog APIs are present. This includes creating an OWNERS files that will make approving/reviewing future PRs easier.
Test 1:
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
annotations:
olm.properties: '[\{"type": "olm.maxOpenShiftVersion", "value": "4.17"}]'
Note the value needs to be equal to the cluster version this is being tested on.
Test 2
Same as test 1 but with two bundles. Message should have names in alphabetical order.
Test 3
Apply a bundle without the annotation. Upgradeable should be True.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
https://docs.google.com/document/d/18m-OG0PN8-jjjgGT33WNujzmj_1B2Tqoqd-bVKX4CkE/edit?usp=sharing
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Refactor cluster-olm-operator to use v1 of the OLM openshift/api/operator API
A/C:
- cluster-olm-operator now uses OLM v1
- OLM resource manifest updated to use v1
- CI is green
OpenShift offers a "capabilities" to allow users to select which components to include in the cluster at install time.
It was decided the capability name should be: OperatorLifecycleManagerV1 [ref
A/C:
- ClusterVersion resource updated with OLM v1 capability
- cluster-olm-operator manifests updated with capability.openshift.io/name=OperatorLifecycleManagerV1 annotation
Promote OLM API in the OpenShift API from v1alpha1 to v1 (see https://github.com/openshift/api/blob/master/operator/v1alpha1/types_olm.go#L1)
A/C:
- openshift/api/operator/v1alpha1 OLM promoted to v1
- openshift/api/operator/v1alpha1 OLM removed
As someone troubleshooting an OLMv1 issue with a cluster, I'd like to be able to see the state of cluster-olm-operator and the OLM resource, so that I can have all the information I need to fix the issue.
A/C:
- must-gather contains cluster-olm-operator namespace and contained resources
- must-gather contains OLM cluster scoped resource
- if cluster-olm-operator fails before updating its ClusterOperator, I'd still want the cluster-olm-operator namespace, it's resources, and the cluster scoped OLM resource to be in the must-gather
Networking Definition of Planned
Epic Template descriptions and documentation
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Goal:
Update team owned repositories to Kubernetes v1.31
?? is the 1.31 freeze
?? is the 1.31 GA
Problem:<please update links for 1.31>
The following repository must be rebased onto the latest version of Kubernetes:
The following repositories should be rebased onto the latest version of Kubernetes:
Entirely remove dependencies on k/k repository inside oc.
Why is this important:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4561
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.31 k8s rebase to track the k8s version as rest of the OpenShift 1.31 cluster.
As part of our continuous improvement efforts, we need to update our Dockerfile to utilize the new multi-base images provided in OpenShift 4.18. The current Dockerfile is based on RHEL 8 and RHEL 9 builder images from OpenShift 4.17, and we want to ensure our builds are aligned with the latest supported images, for multiple architectures.
Updating the RHEL 9 builder image to
registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.22-builder-multi-openshift-4.18
Updating the RHEL 8 builder image to
registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.22-builder-multi-openshift-4.18
Updating the base image to
registry.ci.openshift.org/ocp-multi/4.18-art-latest-multi:machine-config-operator
or specifying a different tag if we dont want to only do mco
Ensuring all references and dependencies in the Dockerfile are compatible with these new images.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository
Epic Goal*
Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.18 cannot be released without Kubernetes 1.31
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
Retro: Kube 1.31 Rebase Retrospective Timeline (OCP 4.18)
Retro recording: https://drive.google.com/file/d/1htU-AglTJjd-VgFfwE3z_dH5tKXT1Tes/view?usp=drive_web
Description of problem:
Given 2 images with different names, but same layers, "oc image mirror" will only mirror 1 of them. For example: $ cat images.txt quay.io/openshift/community-e2e-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/openshift/community-e2e-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS $ oc image mirror -f images.txt quay.io/ bertinatto/test-images manifests: sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 -> e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS stats: shared=0 unique=0 size=0B phase 0: quay.io bertinatto/test-images blobs=0 mounts=0 manifests=1 shared=0 info: Planning completed in 2.6s sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS info: Mirroring completed in 240ms (0B/s)
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Only one of the images were mirrored.
Expected results:
Both images should be mirrored.
Additional info:
This PR https://github.com/openshift/origin/pull/29141 loosens the check to ignore the warning message in the output in order to unblock https://github.com/openshift/oc/pull/1877. Once the requires PRs are merged, we should revert back to `o.Equal` again. This issue is created to track this work.
Following the recent changes in the CRD schema validation (introduced in https://github.com/kubernetes-sigs/controller-tools/pull/944), our tooling have identified several CRD violations in our APIs:
TechPreview clusters are unable to bootstrap because kube-apiserver fails to start with the following error:
E0827 20:29:22.653501 1 run.go:72] "command failed" err="group version resource.k8s.io/v1alpha2 that has not been registered"
This happens because, in Kubernetes 1.31, the group version resource.k8s.io/v1alpha2 was removed and replaced with resource.k8s.io/v1alpha3. This is part of the DynamicResourceAllocation feature, which is currently TechPreview.
After discussing this with the team, we decided that the best approach is to modify the cluster-kube-apiserver-operator to start the kube-apiserver with the correct group version based on the Kubernetes version being used.
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We utilize MCO annotations to determine whether a node is degraded or unavailable, and we solely source the Reason annotation to put into the insight. Many common cases are not covered by this, especially the unavailable ones: nodes can be cordoned, have a condition like DiskPressure, be in the process of termination etc. Not sure whether our code or something like MCO should provide it, but captured this as a card for now.
An update is in progress for 28m42s: Working towards 4.14.1: 700 of 859 done (81% complete), waiting on network = Control Plane = ... Completion: 91%
1. Inconsistent info: CVO message says "700 of 859 done (81% complete)" but control plane section says "Completion: 91%"
2. Unclear measure of completion: CVO message counts manifest applied and control plane section says "Completion: 91%" which counts upgraded COs. Both messages do not state what they count. Manifest count is an internal implementation detail which users likely do not understand. COs are less so, but we should be more clear in what the completion means.
3. We could take advantage of this line and communicate progress with more details
We'll only remove CVO message once the rest of the output functionally covers it, so the inconsistency stays until OTA-1154. Otherwise:
= Control Plane = ... Completion: 91% (30 operators upgraded, 1 upgrading, 2 waiting)
Upgraded operators are COs that updated its version, no matter its conditions
Upgrading operators are COs that havent updated its version and are Progressing=True
Waiting operators are COs that havent updated its version and are Progressing=False
During an upgrade, once control plane is successfully updated, status items related to that part of the upgrade cease to be relevant, and therefore we can either hide them entirely, or we can show a simplified version of them. The relevant sections are Control plane and Control plane nodes.
As an OTA engineer,
I would like to make sure the node in a single-node cluster is handled correctly in the upgrade-status command.
Context:
According to the discussion with the MCO team,
the node is in MCP/master but not worker.
This card is to make sure that the node are displayed that way too. My feeling is that the current code probably does the job already. In that case, we should add test coverage for the case to avoid regression in the future.
AC:
Address performance and scale issues in Whereabouts IPAM CNI
Whereabouts is becoming increasingly more popular for use on workloads that operate at scale. Whereabouts was originally built as a convenience function for a handful of IPs, however, more and more customers want to use whereabouts in scale sitatuions.
Notably, for telco and ai/ml scenarios. Some ai/ml scenarios launch a large number of pods that need to use secondary networks for related traffic.
Upstream collaboration outline
This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.
To test:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.
As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.
As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(MCO-770, MCO-578, MCO-574 )
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.
Maybe:
Entitlements: MCO-1097, MCO-1099
Not Likely:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.
Currently, we are using bare pod objects for doing our image builds. While this works, it makes adding retry logic and other things much more difficult since we will have to implement this logic. Instead, we should use Kubernetes Jobs objects.
Jobs have built-in mechanisms for retrying, exponential backoff, concurrency controls, etc. This frees us from having to implement complicated retry logic for build failures beyond our control such as pod evictions, etc.
Done When:
The Insights Operator syncs the customer's Simple Content Access certificate to the etc-pki-entitlement secret in the openshift-config-managed namespace every 8 hours. Currently, the user is expected to clone this secret into the MCO namespace, prior to initiating a build if they require this cert during the build process. We'd like this step automated so that user does not have to do this manual step.
Whenever a must-gather is collected, it includes all of the objects at the time of the must-gather creation. Right now, must-gathers do not include MachineOSConfigs and MachineOSBuilds, which would be useful to have for support and debugging purposes.
Done When:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.
To test:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.
As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.
As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(MCO-770, MCO-578, MCO-574 )
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.
Maybe:
Entitlements: MCO-1097, MCO-1099
Not Likely:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.
Currently, it is not possible for cluster admins to revert from a pool that is opted into on-cluster builds and layered MachineConfig updates. See https://issues.redhat.com/browse/OCPBUGS-16201 for details around what happens.
It is worth mentioning that this is mostly an issue for UPI (user provided infrastructure) / bare metal users of OpenShift. For IPI cases in AWS / GCP / Azure / et. al., one can simply delete the node and the machine, which will cause the Machine API to provision a fresh node to replace it, e.g.:
#!/bin/bash node_name="$1" node_name="${node_name/node\//}" machine_id="$(oc get "node/$node_name" -o jsonpath='{.metadata.annotations.machine\.openshift\.io/machine}')" machine_id="${machine_id/openshift-machine-api\//}" oc delete --wait=false "machine/$machine_id" -n openshift-machine-api oc delete --wait=false "node/$node_name"
Done When
Description of problem:
When we create a MOSC to enable OCL in a pool, and then we delete the MOSC resource to revert it, then the MOSB and CMs are garbage collected but we need to wait a long and random time until the nodes are updated with the new config.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-0.test-2024-10-15-080246-ci-ln-0gsqflb-latest True False 8h Cluster version is 4.18.0-0.test-2024-10-15-080246-ci-ln-0gsqflb-latest
How reproducible:
Always
Steps to Reproduce:
1. Create a MOSC to enable OCL in the worker pool 2. Wait until the new OCL image is applied to all worker nodes 3. Remove the MOSC resource created in step 1
Actual results:
MOSB and CMs are cleaned, but the nodes are not updated. After a random amount of time the nodes are updated. (Somewhere around 10-20 minutes)
Expected results:
There should be no long pause between the deletion of the MOSC resource and the beginning of the nodes update process.
Additional info:
As a workaround, if we add any label to the worker pool to force a sync operation the worker nodes start updating immediately.
Description of problem:
When OCL is configured in a cluster using a proxy configuration, OCL is not using the proxy to build the image.
Version-Release number of selected component (if applicable):
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.8 True False 5h14m Cluster version is 4.16.0-rc.8
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster that uses a proxy and cannot access the internet if not by using this proxy We can do it by using this flexy-install template, for example: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/5724d9c157d51f175069c5bf09be1872173d0167/functionality-testing/aos-4_16/ipi-on-aws/versioned-installer-customer_vpc-http_proxy-multiblockdevices-fips-ovn-ipsec-ci private-templates/functionality-testing/aos-4_16/ipi-on-aws/versioned-installer-customer_vpc-http_proxy-multiblockdevices-fips-ovn-ipsec-ci 2. Enable OCL in a machineconfigpool by creating a MOSC resrouce
Actual results:
The build pod will not use the proxy to build the image and it will fail with a log similar to this one time="2024-06-25T13:38:19Z" level=debug msg="GET https://quay.io/v1/_ping" time="2024-06-25T13:38:49Z" level=debug msg="Ping https://quay.io/v1/_ping err Get \"https://quay.io/v1/_ping\": dial tcp 44.216.66.253:443: i/o timeout (&url.Error{Op:\"Get\", URL:\"https://quay.io/v1/_ping\", Err:(*net.OpError)(0xc000220d20)})" time="2024-06-25T13:38:49Z" level=debug msg="Accessing \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883\" failed: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.216.66.253:443: i/o timeout" time="2024-06-25T13:38:49Z" level=debug msg="Error pulling candidate quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.216.66.253:443: i/o timeout" Error: creating build container: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 44.216.66.253:443: i/o timeout time="2024-06-25T13:38:49Z" level=debug msg="shutting down the store" time="2024-06-25T13:38:49Z" level=debug msg="exit status 125"
Expected results:
The build should be able to access the necessary resources by using the configured proxy
Additional info:
When verifying this ticket, we need to pay special attention to https proxies using their own user-ca certificate We can use this flexy-install template: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/5724d9c157d51f175069c5bf09be1872173d0167/functionality-testing/aos-4_16/ipi-on-osp/versioned-installer-https_proxy private-templates/functionality-testing/aos-4_16/ipi-on-osp/versioned-installer-https_proxy In this kind of clusters it is not enough to use the proxy to build the image, but we need to use the /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt file to be able to reach the yum repositories, since rpm-ostree will complain about an intermediate certificate (this one of the https proxy) being self-signed. To test it we can use a custom Containerfile including something simelar to: RUN cd /etc/yum.repos.d/ && curl -LO https://pkgs.tailscale.com/stable/fedora/tailscale.repo && \ rpm-ostree install tailscale && rpm-ostree cleanup -m && \ systemctl enable tailscaled && \ ostree container commit
BuildController is responsible for a lot of things. Unfortunately, it is very difficult to determine where and how BuildController does its job, which makes it more difficult to extend and modify as well as test.
Instead, it may be more useful to think of BuildController as the thing that converts MachineOSBuilds into build pods, jobs, et. al. Similar to how we have a subcontroller for dealing with build pods, we should have another subcontroller whose job is to produce MachineOSBuilds.
Done When:
Description of problem:
When OCL is enabled and we configure several MOSC resources for several MCPs, the MCD pods are restarted every few seconds. They should only be restarted once per MOSC, instead they are continuously restarted.
Version-Release number of selected component (if applicable):
IPI on AWS version 4.17.0-0.test-2024-10-02-080234-ci-ln-2c0xsqb-latest
How reproducible:
Always
Steps to Reproduce:
1. Enable techpreview 2. Create 5 custom MCPs 3. Create one MOSC resource for each new MCP
Actual results:
MCD pods will be restarted every few seconds $ oc get pods NAME READY STATUS RESTARTS AGE kube-rbac-proxy-crio-ip-10-0-31-199.us-east-2.compute.internal 1/1 Running 4 4h51m kube-rbac-proxy-crio-ip-10-0-31-37.us-east-2.compute.internal 1/1 Running 4 4h43m kube-rbac-proxy-crio-ip-10-0-38-189.us-east-2.compute.internal 1/1 Running 4 4h51m kube-rbac-proxy-crio-ip-10-0-54-127.us-east-2.compute.internal 1/1 Running 3 4h43m kube-rbac-proxy-crio-ip-10-0-69-126.us-east-2.compute.internal 1/1 Running 4 4h51m machine-config-controller-d6bdf7d85-2wb22 2/2 Running 0 113m machine-config-daemon-d7t4d 2/2 Running 0 6s machine-config-daemon-f7vv2 2/2 Running 0 12s machine-config-daemon-h8t8z 2/2 Running 0 8s machine-config-daemon-q9fhr 2/2 Running 0 10s machine-config-daemon-xvff2 2/2 Running 0 4s machine-config-operator-56cdd7f8fd-wlsdd 2/2 Running 0 105m machine-config-server-klggk 1/1 Running 1 4h48m machine-config-server-pmx2n 1/1 Running 1 4h48m machine-config-server-vwxjx 1/1 Running 1 4h48m machine-os-builder-7fb58586bc-sq9rj 1/1 Running 0 50m
Expected results:
MCD pods should only be restarted once for every MOSC
Additional info:
As an OpenShift cluster admin, I would like to try out on-cluster layering (OCL) to better understand how it works, how to set it up, and how to use it. To that end, a quick-start guide for what I need to do to get started as well as a troubleshooting guide would be indispensable.
Done When:
Within BuildController, there is a lot of code concerned with creating all of the ephemeral objects for performing a build, converting secrets from one form to another, cleaning up after the build is completed, etc. Unfortunately, because of how BuildController is currently written, this code has become a bit unwieldy and difficult to modify and test. In addition, it is very difficult to reason about what is actually happening. Therefore, it should be broken up and refactored into separate modules within pkg/controller/build.
By doing this, we can have very high test granularity as well as tighter assertions for the places where it is needed the most while simultaneously allowing looser and more flexible testing for BuildController itself.
Done When:
ETCD backup API was delivered behind a feature gate in 4.14. This feature is to complete the work for allowing any OCP customer to benefit from the automatic etcd backup capability.
The feature introduces automated backups of the etcd database and cluster resources in OpenShift clusters, eliminating the need for user-supplied configuration. This feature ensures that backups are taken and stored on each master node from the day of cluster installation, enhancing disaster recovery capabilities.
The current method of backing up etcd and cluster resources relies on user-configured CronJobs, which can be cumbersome and prone to errors. This new feature addresses the following key issues:
Complete work to auto-provision internal PVCs when using the local PVC backup option. (right now, the user needs to create PVC before enabling the service).
Out of Scope
The feature does not include saving cluster backups to remote cloud storage (e.g., S3 Bucket), automating cluster restoration, or providing automated backups for non-self-hosted architectures like Hypershift. These could be future enhancements (see OCPSTRAT-464)
Epic Goal*
Provide automated backups of etcd saved locally on the cluster on Day 1 with no additional config from the user.
Why is this important? (mandatory)
The current etcd automated backups feature requires some configuration on the user's part to save backups to a user specified PersistentVolume.
See: https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L46
Before the feature can be shipped as GA, we would require the capability to save backups automatically by default without any configuration. This would help all customers have an improved disaster recovery experience by always having a somewhat recent backup.
Scenarios (mandatory)
Implementation details:
One issue we need to figure out during the design of this feature is how the current API might change as it is inherently tied to the configuration of the PVC name.
See:
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L99
and
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/operator/v1alpha1/types_etcdbackup.go#L44
Additionally we would need to figure out how the etcd-operator knows about the available space on local storage of the host so it can prune and spread backups accordingly.
Dependencies (internal and external) (mandatory)
Depends on changes to the etcd-operator and the tech preview APIs
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Upon installing a tech-preview cluster backups must be saved locally and their status and path must be visible to the user e.g on the operator.openshift.io/v1 Etcd cluster object.
An e2e test to verify that the backups are being saved locally with some default retention policy.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As a developer, I want to add etcd-backup-server container within a separate deployment away from the etcd static pod.
As a developer, I want to add etcd-backup-server container within a separate deployment away from the etcd static pod.
As a developer, I want to add e2e test for the ** etcd-backup-server sidecar container
As a developer, I want to add etcd backup pruning logic within etcd-backup-server sidecar container
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
OpenShift is planning to ship all payload and layered product images signed consistently via cosign with OpenShift 4.17. oc-mirror should be able to leverage this to provide a seamless signature verification experience in an offline environment by automatically making all required signature artifacts available in the offline registry.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Overview
This task is really to ensure oc-mirror v2 has backward compatibility to what v1 was doing regarding signatures
Goal
Ensure the correct configmaps are generated and stored in a folder so that the user can deploy the related artifact/s to the cluster as in v1
As a user deploying OpenShift on bare metal I want the installer to use the NTP servers that I specify at install time.
When the Ironic pre-provisioning image containing IPA is running, there is no way to sync the clocks to a custom NTP server. This causes issues with certificates - IPA generates a certificate for itself to be valid starting 1 hour in the past (see OCPBUGSM-21571), so if the hardware clock is more than 1 hour ahead of the real time then the certificate will be rejected by Ironic.
A new field is required in install-config.yaml where the user can specify additional NTP servers that can then be used to set up a chrony config in the IPA ISO. (Potentially this could also be used to automatically generate the MachineConfig manifests to add the same config to the cluster.)
See initial discussion here: OCPBUGS-22957
When the Ironic pre-provisioning image containing IPA is running, there is no way to sync the clocks to a custom NTP server. This causes issues with certificates - IPA generates a certificate for itself to be valid starting 1 hour in the past (see OCPBUGSM-21571), so if the hardware clock is more than 1 hour ahead of the real time then the certificate will be rejected by Ironic.
A new field is required in install-config.yaml where the user can specify additional NTP servers that can then be used to set up a chrony config in the IPA ISO. (Potentially this could also be used to automatically generate the MachineConfig manifests to add the same config to the cluster.)
See initial discussion here: OCPBUGS-22957
Create an ICC patch that will read the new env variable for additional NTP servers and use it to create a chrony ingnition file.
Create an CBO patch to add a field for additional NTP servers that will be passed to image customization.
Feature description
Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals:
Check if it is possible to delete operators using the delete command when the previous command was mirror to mirror. Probably it won't work because in mirror to mirror the cache is not updated.
It is necessary to find a solution for this scenario.
oc-mirror should account for users who are relying on oc-mirror v1 in production and accomodate an easy migration:
The way of tagging images for releases, operators and additional images is different between v1 and v2. So it is necessary to have some kind of migration feature in order to enable customers to migrate from one version to the other.
Use cases:
The solutions is still to be discussed.
Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.
Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.
We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.
As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.
As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
TBD
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
\
In OSASINFRA-3483, we modified openshift/cluster-storage-operator to integrate support for kustomize and provide the infrastructure to generate two sets of assets: one for standalone deployment, and one for hypershift deployment. In this story, we will track actually adding support for the latter.
In OSASINFRA-3610, we merged the openshift/csi-driver-manila-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.
In OSASINFRA-3608, we merged the openshift/openstack-cinder-csi-driver-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.
In OSASINFRA-3483, we modified openshift/cluster-storage-operator to integrate support for kustomize and provide the infrastructure to generate two sets of assets: one for standalone deployment, and one for hypershift deployment. In this story, we will track actually adding support for the latter.
We want to prepare cluster-storage-operator for eventual Hypershift integration. To this end, we need to migrate the assets and references to same to integrate kustomize. This will likely look similar to https://github.com/openshift/cluster-storage-operator/pull/318 once done (albeit, without the Hypershift work).
This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.
We don't need to create another service for Ingress, so we can save a FIP.
Matthew Booth is worried about that feature that we added to pre-create a FIP and assign it to the Service object for router-default. This is indeed racy and could be problematic if another controller would take over that field as well, it'll create infinite loops and the result wouldn't be great for customers.
The idea is to remove that feature now and eventually add it back later when it's safer (e.g. feature added to the Ingress operator?). It's worth noting that core kubernetes has deprecated the loadBalancerIP field in the Service object, and it now works with annotations. Maybe we need to investigate that path.
Right now, our pods are SingleReplica because to have multiple replicas we need more than one zone for nodes which translates into AZ in OpenStack. We need to figure that out.
We should not have to explicitly configure the location of the clouds.yaml file, since there is a list of well-known places where these can be found. We should also be able to configure the cloud used from the chosen clouds.yaml.
Being able to connect the node pools to additional networks, like we support already on standalone clusters.
This task will be necessary for some use cases, like using Manila CSI on a storage network, or running NFV workload on a SRIOV provider network or also running ipv6 dual stack workloads on a provider network.
I see at least 2 options:
One thing we need to solve as well is the fact that when a Node has > 1 port, kubelet won't necessarily listen on the primary interface. We need to address that too; and it seems CPO has an option to define the primary network name: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#networking
If we don't solve that, the nodepool (worker) won't join the cluster since Kubelet might listen on the wrong interface.
When the management cluster runs on AWS, make sure we update the DNS record for *apps, so ingress can work out of the box.
HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.
Stop using the openshift/installer-aro repo during installation of ARO cluster. installer-aro is a fork of openshift/installer with carried patches. Currently it is vendored into openshift/installer-aro-wrapper in place of the upstream installer.
Maintaining this fork requires considerable resources from the ARO team, and results in delays of offering new OCP releases through ARO. Removing the fork will eliminate the work involved in keeping it up to date from this process.
https://docs.google.com/document/d/1xBdl2rrVv0EX5qwhYhEQiCLb86r5Df6q0AZT27fhlf8/edit?usp=sharing
It appears that the only work required to complete this is to move the additional assets that installer-aro adds for the purpose of adding data to the ignition files. These changes can be directly added to the ignition after it is generated by the wrapper. This is the same thing that would be accomplished by OCPSTRAT-732, but that ticket involves adding a Hive API to do this in a generic way.
The OCP Installer team will contribute code changes to installer-aro-wrapper necessary to eliminate the fork. The ARO team will review and test changes.
The fork repo is no longer vendored in installer-aro-wrapper.
Add results here once the Initiative is started. Recommend discussions & updates once per quarter in bullets.
Currently the Azure client can only be mocked in unit tests of the pkg/asset/installconfig/azure package. Using the mockable interface consistently and adding a public interface to set it up will allow other packages to write unit tests for code involving the Azure client.
We deprecated "DeploymentConfig" in-favor of "Deployment" in OCP 4.14
Now in 4.18 we want to make "Deployment " as default out of box that means customer will get Deployment when they install OCP 4.18 .
Deployment Config will still be available in 4.18 as non default for user who still want to use it .
FYI "DeploymentConfig" is tier 1 API in Openshift and cannot be removed from 4.x product
Please Review this FAQ : https://docs.google.com/document/d/1OnIrGReZKpc5kzdTgqJvZYWYha4orrGMVjfP1fUpljY/edit#heading=h.oranye5nwtsy
Epic Goal*
WRKLDS-695 was implemented to make the DC enabled through capability in 4.14. In order to prepare customers for migration to Deployments the capability got enabled by default. After 3 releases we need to reconsider whether disabling the capability by default is feasible.
More about capabilities in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capability-sets.
Why is this important? (mandatory)
Disabling a capability by default make an OCP installation lighter. Less component running by default reduces a security risk/vulnerability surface.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
None. The DC capability can be enabled if needed.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Before the DCs can be disabled by default all the relevant e2e relying on DCs need to be migrated to Deployments to maintain the same testing coverage.
This feature enables users of Hosted Control Planes (HCP) on bare metal to provision spoke clusters from ACM at scale, supporting hundreds to low thousands of clusters per hub cluster. It will use ACM's multi-tenancy to prevent interference across clusters. The implementation assumes the presence of workers in hosted clusters (either bare metal or KubeVirt).
We have a customer requirement to allow for massive scale & TCO reduction via Multiple ACM Hubs on a single OCP Cluster - Kubevirt Version
When using OpenShift in a mixed, multi-architecture environment some key details or checks or not always available. With this feature we will take a first pass at improving the UI/UX for customers as adoption of this configuration continues at pace.
The UI/UX experience should improved when being used in a mixed architecture OCP cluster
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Y |
Classic (standalone cluster) | Y |
Hosted control planes | Y |
Multi node, Compact (three node), or Single node (SNO), or all | Y |
Connected / Restricted Network | Y |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All architectures |
Operator compatibility | n/a |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | OpenShift Console |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Add support to GCP N4 Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud
As a user, I want to deploy OpenShift on Google Cloud using N4 Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types
OpenShift can be deployed in Google Cloud using the new N4 Machine Series for the Control Plane and Compute Nodes
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Google has made N4 Machine Series available on their cloud offering. These Machine Series use "hyperdisk-balanced" disk for the boot device that are not currently supported
The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the N4 Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift
As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only, so that, after I mirror my catalog I can check that it contains the filtered operators using:
$ oc-mirror list operators --catalog mirror.syangsao.net:8443/ocp4/redhat/redhat-operator-index:v4.12
In oc-mirror v2 (and in v1 after bug fix OCPBUGS-31536), oc-mirror doesn't rebuild catalogs.
As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only, so that, after I mirror my catalog I can check that it contains the filtered operators using:
oc-mirror list operators --catalog mirror.syangsao.net:8443/ocp4/redhat/redhat-operator-index:v4.12
In oc-mirror v2 (and in v1 after bug fix OCPBUGS-31536), oc-mirror doesn't rebuild catalogs.
This user story is to cover all the scenarios that were not covered by CLID-230
Currently buildah could bring problems due the unshare.
I found some problems
image is not a manifest list
and the only way out was to rm -fr $HOME/.local/share/containers/storage
From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1812240892
In order to keep the single-responsability principle, the rebuild of the catalog should happen outside the collector phase.
Each catalog filtered should have its own folder named by the digest of its contents and inside of this folder the following items should be present:
From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1806084982
Since o.Opts is already passed to imagebuilder.NewBuilder(), passing o.Opts.SrcImage.TlsVerify and o.Opts.DestImage.TlsVerify is not needed as additional arguments.
From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1812461948
Ideally ImageBuilderInterface would be the interface to build any kind of image, since RebuildCatalogs is very specific only for catalog images, it would be better to have a separate interface only for that or reuse BuildAndPush.
From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1806145892
Keeping the container file used to filter the catalog in the working-dir can help in troubleshooting.
maybe look at adding spinners here, at saying at which catalog we are...
This implies that we generate a new declarative config containing only a portion of the declarative config.
Acceptance criteria:
This story is about creating an image that contains opm, the declarative config (and optionally the cache)
Multiple solutions here:
Acceptance criteria:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo. Having a common repo will across drivers will ease maintenance burden.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | yes |
Classic (standalone cluster) | yes |
Hosted control planes | all |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
N/A includes all the CSI operators Red Hat manages as part of OCP
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
This effort started with CSI operators that we included for HCP, we want to align all CSI operator to use the same approach in order to limit maintenance efforts.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Not customer facing, this should not introduce any regression.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
No doc needed
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
N/A, it's purely tech debt / internal
Epic Goal*
Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.
Why is this important? (mandatory)
Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.
Scenarios (mandatory)
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Dependencies (internal and external) (mandatory)
None, this can be done just by the storage team and independently on other operators / features.
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Implement the following step from the enhancement
Implement the following step of the enhancement
Implement one of the post migration steps
Epic Goal*
Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.
Why is this important? (mandatory)
Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.
Scenarios (mandatory)
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Note: we do not plan to do any changes for HyperShift. The EFS CSI driver will still fully run in the guest cluster, including its control plane.
Dependencies (internal and external) (mandatory)
None, this can be done just by the storage team and independently on other operators / features.
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Epic Goal*
Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.
Why is this important? (mandatory)
Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.
Scenarios (mandatory)
As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).
As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.
Dependencies (internal and external) (mandatory)
None, this can be done just by the storage team and independently on other operators / features.
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Following the step described in the enhancement, we should do the following:
Once this is done, we can work towards rewriting the operator to take advantage of the new generator tooling used for existing migrated operators.
In OSASINFRA-3609 we moved the existing Cinder CSI Driver Operator from openshift/openstack-cinder-csi-driver-operator to openshift/csi-operator, adding the contents of the former in a legacy/openstack-cinder-csi-driver-operator directory in the latter Now, we need to rework or adapt this migrated code to integrate it fully into the csi-operator.
Following the step described in the enhancement, we should do the following:
Once this work is complete, we can investigate adding HyperShift support to this driver. That work will be tracked and addressed via a separate epic.
Intel VROC (Virtual RAID on CPU) is a nontraditional RAID option that can offer some management and potential performance improvements compared to a traditional hardware raid. RAID devices can be set up from firmware or via remote management tools and present as MD devices.
Initial support was delivered in OpenShift 4.16. This feature is to enhance that support by:
Any technologies not already supported by the RHEL kernel.
**
https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html
Interoperability Considerations
Allow users of Intel VROC hardware to deploy OpenShift to it via the Assisted Installer.
https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html
Currently the support only exists with UPI deployments. The Assisted Installer blocks it.
Assisted Installer can deploy to hardware using the Intel VROC.
Yes
Intel VROC support exists in OpenShift, just not the Assisted Installer, this epic seeks to add it.
We support Intel VROC with OpenShift UPI but Assisted Installer blocks it. Please see https://issues.redhat.com/browse/SUPPORTEX-22763 for full details of testing and results.
Customers using Intel VROC with OpenShift will want to use Assisted Installer for their deployments. As do we.
TBC
Assisted installer is part of NPSS so this will benefit Telco customers using NPSS with Intel VROC.
Brings Assisted installer into alignment with the rest of the product.
Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.
Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.
Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.
This needs to be backported to 4.14 so we have a better sense of the fleet as it is.
4.12 might be useful as well, but is optional.
Why not simply block upgrades if there are locally layered packages?
That is indeed an option. This card is only about gathering data.
Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.
Description copied from attached feature card: https://issues.redhat.com/browse/OCPSTRAT-1521
Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.
Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.
Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.
This needs to be backported to 4.14 so we have a better sense of the fleet as it is.
4.12 might be useful as well, but is optional.
Why not simply block upgrades if there are locally layered packages?
That is indeed an option. This card is only about gathering data.
Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.
create an e2e test that confirms that metrics collection in the MCD works and that it collects unsupported package installations using rpm-ostree
Implement the logic in the MCO Daemon to collect the defined metrics and send them to Prometheus. For the Prometheus side of things, this will involve some manipulation in `metrics.go`.
Acceptance Criteria:
1. The MCO daemon should collect package installation data (defined from the spike MCO-1275) during its normal operation.
2. The daemon should report this data to Prometheus at a specified time interval (defined from spike MCO-1277).
3. Include error handling for scenarios where the rpm-ostree command fails or returns unexpected results.
TBD
Implement authorization to secure API access for different user personas/actors in the agent-based installer.
User Personas:
This is
The agent-based installer APIs have implemented basic security measures through authentication, as covered in AGENT-145.
To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role.
The goal of this implementation is to provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role.
As a developer working on the Assisted Service, I want to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a wait-for and monitor-add-nodes user, I want to be able to:
So that I can achieve:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user with userAuth, agentAuth, and watcherAuth persona (wait-for and monitor-add-nodes):
So that I can achieve:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.
When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.
There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.
In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.
Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.
When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.
There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.
In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.
Support in the IPI installer for OpenShift on vSphere to create the OpenShift node VMs with multiple NICs and subnets.
This is necessary when users want to have dedicated network links in the node VMs for storage or database for example, in addition to the service network link that we create now
Requirements
Users can specify multiple NICs for the OpenShift VMs that will be created for the OpenShift cluster nodes with different subnets.
Support in the IPI installer for OpenShift on vSphere to create the OpenShift node VMs with multiple NICs and subnets.
This is necessary when users want to have dedicated network links in the node VMs for storage or database for example, in addition to the service network link that we create now
Requirements
Users can specify multiple NICs for the OpenShift VMs that will be created for the OpenShift cluster nodes with different subnets.
Description:
The machine config operator needs to be bumped to pick up the API change:
I0819 17:50:00.396986 1 machineconfig.go:87] ControllerConfig not found, creating new one E0819 17:50:00.400599 1 machineconfig.go:90] Failed to create ControllerConfig: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Acceptance Criteria:
Description:
The infrastructure spec validation needs to be updated to change the network count restriction to [10|https://configmax.esp.vmware.com/guest?vmwareproduct=vSphere&release=vSphere%208.0&categories=1-0.]
When multiple NICs are enabled(the installer allows this?) bootstrapping fails with:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1673] failed to create some manifests: Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Acceptance Criteria:
issue created by splat-bot
{}USER STORY:{}
As an OpenShift provisioner, I want to provision a cluster in which nodes have multiple network adapters so that I can implement the desired network topology.
{}DESCRIPTION:{}
Customers have a need to provision nodes with multiple adapters in day 0. capv supports the ability to specify multiple adapters in its clone spec. The installer should be augmented to support additional NICs.
{}Required:{}
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
{}ENGINEERING DETAILS:{}
The machine API is failing to render compute nodes when multiple NICs are configured:
Unable to apply 4.17.0-0.ci.test-2024-08-15-193100-ci-ln-igm0nhk-latest: ControllerConfig.mac hineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules w ere not checked because the object was invalid; correct the existing errors to complete validation]
Description:
Bump machine-api to pick up changes in openshift/api#2002.
Acceptance Criteria:
issue created by splat-bot
Improve the cluster expansion with the agent workflow added in OpenShift 4.16 (TP) and OpenShift 4.17 (GA) with:
Improve the user experience and functionality of the commands to add nodes to clusters using the image creation functionality.
Currently dev-scripts supports the add-nodes workflow by using only the ISO. We should be able to select the mode to add a node via an explicit config variables, so that also the pxe approach could be used
Improve the output shown for monitor command, especially in the case of multiple nodes, so that it could be more readable.
Note
A possible approach could be to change the monitoring logic in a polling loop, where nodes are grouped by "stages". A stage represents which point the node reached while working over the add workflow (we don't have yet defined them).
Run integration tests for presubmit jobs in the installer repo
This page https://github.com/openshift/installer/blob/master/docs/user/agent/agent-services.md needs to be updated, to reflect the new services available in case of add nodes workflow vs install workflow
The add-nodes-image command may also generate PXE artifacts (instead of the ISO). This will require an additional command flag (and review the command name)
(evaluate also the possibility to use instead a sub-command)
Currently the oc node-image create command looks for the kube-system/cluster-config-v1 resource to infer some of the required elements for generating the ISO.
The main issue is that the kube-system-cluster-config-v1 resource may be stale, since it contains information used when the cluster was installed, and that may have changed during the lifetime of the cluster.
tech note about the replacement
Field | Source |
---|---|
APIDNSName | oc get infrastructure cluster -o=jsonpath='
{.status.apiServerURL}
' |
ImageDigestSource | oc get imagedigestmirrorsets image-digest-mirror -o=jsonpath='
{.spec.imageDigestMirrors}
' |
ImageContentSources | oc get imagecontentsourcepolicy |
ClusterName | Derived from APIDNSName (api.<cluster name>.<base domain>) |
SSHKey | oc get machineconfig 99-worker-ssh -o jsonpath='
{.spec.config.passwd.users[0].sshAuthorizedKeys}
' |
FIPS | oc get machineconfig 99-worker-ssh -o jsonpath='
{.spec.fips}
' |
(see also Zane Bitter comment in https://issues.redhat.com/browse/OCPBUGS-38802)
Currently the oc node-image create command does not report any revelant information that could help the user to understand which element was retrieved from (for example, the SSH key), thus making more difficult to troubleshoot an eventual issue.
For this reason, it could be useful that the node-joiner tool would produce a proper json file, reporting all the details about the relevent resources fetched for generating image. The oc command should be able to expose them when required (ie via command flag)
Currently the error reporting of the oc node-image create command is pretty rough, as it prints out in the console the log traces captured from the node-joiner pod standard output. Even though this could help the user in understanding the problem, a lot of many unnecessary technical details are exposed, making the overall experience cumbersome.
For this reasons, node-joiner tool should generate a proper json file with the outcome of the action, including all the error messages eventually found.
The oc command should fetch such json output and report it in the console, instead of the showing up the node-joiner pod logs output.
Use also a flag to report the full pod logs, in case of troubleshooting
Manage the backward compatibility with the older version of node-joiner that does not support the enhanced output
Support adding nodes using PXE files instead of ISO.
Questions
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
A set of capabilities need to be added to the Hypershift Operator that will enable AWS Shared-VPC deployment for ROSA w/ HCP.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Build capabilities into HyperShift Operator to enable AWS Shared-VPC deployment for ROSA w/ HCP.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Antoni Segura Puimedon Please help with providing what Hypershift will need on the OCPSTRAT side.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | (perhaps) both |
Classic (standalone cluster) | |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_64 and Arm |
Operator compatibility | |
Backport needed (list applicable versions) | 4.14+ |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no (this is an advanced feature not being exposed via web-UI elements) |
Other (please specify) | ROSA w/ HCP |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Currently the same SG is used for both workers and VPC endpoint. Create a separate SG for the VPC endpoint and only open the ports necessary on each.
"Shared VPCs" are a unique AWS infrastructure design: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html
See prior work/explanations/etc here: https://issues.redhat.com/browse/SDE-1239
Summary is that in a Shared VPC environment, a VPC is created in Account A and shared to Account B. The owner of Account B wants to create a ROSA cluster, however Account B does not have permissions to create a private hosted zone in the Shared VPC. So they have to ask Account A to create the private hosted zone and link it to the Shared VPC. OpenShift then needs to be able to accept the ID of that private hosted zone for usage instead of creating the private hosted zone itself.
QE should have some environments or testing scripts available to test the Shared VPC scenario
The AWS endpoint controller in the CPO currently uses the control plane operator role to create the private link endpoint for the hosted cluster as well as the corresponding dns records in the hypershift.local hosted zone. If a role is created to allow it to create that vpc endpoint in the vpc owner's account, the controller would have to explicitly assume the role so it can create the vpc endpoint, and potentially a separate role for populating dns records in the hypershift.local zone.
The users would need to create a custom policy to enable this
Add the necessary API fields to support a Shared VPC infrastructure, and enable development/testing of Shared VPC support by adding the Shared VPC capability to the hypershift CLI.
The e2e tests that were introduced in U/S OVN-K repo should be ported and added to D/S.
Console enhancements based on customer RFEs that improve customer user experience.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
As a user, I want to access the Import from Git and Container image form from the admin perspective as well.
Provide Import from Git and Container image option to redirect the users to respective form.
Customer would like to be able to start individual CronJobs manually via a button in the OpenShift Webconsole, without having to use the OCI CLI.
To start a Job from a CronJob using CLI, following command is being used:
$ oc create job a-cronjob --from=cronjob/a-cronjob
AC:
Created from https://issues.redhat.com/browse/RFE-6131
As a cluster admin I want to set a cluster wide setting for hiding the "Getting started resources" banner from Overview, for all the console users.
AC:
As a cluster admin I want to set a cluster wide setting for hiding the "Getting started resources" banner from Overview, for all the console users.
AC:
As a user who is visually impaired, or a user who is out in the sun, when I switch the theme in the console to Light mode, then try to edit text files (e.g., the YAML configuration for a pod) using the web console, I want the editor to be in light theme.
Allow users to create an RHCOS image to be used for bootstrapping new clusters.
The IPI installer is currently uploading the RHCOS image to all AOS Clusters. In environments where each cluster is on a different subnet this uses unnecessary bandwidth and takes a long time on low bandwidth networks.
The goal is to use a pre-existing VM images in Prism Central to bootstrap the cluster
Allow users to create an RHCOS image to be used for bootstrapping new clusters.
The IPI installer is currently uploading the RHCOS image to all AOS Clusters. In environments where each cluster is on a different subnet this uses unnecessary bandwidth and takes a long time on low bandwidth networks.
The goal is to use a pre-existing VM images in Prism Central to bootstrap the cluster
Add support to GCP C4/C4A Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud
As a user, I want to deploy OpenShift on Google Cloud using C4/C4A Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types
OpenShift can be deployed in Google Cloud using the new C4/C4A Machine Series for the Control Plane and Compute Nodes starting in OpenShift 4.17.z
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Google has made C4/C4A Machine Series available on their cloud offering.
The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the C4/C4A Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift
1. Add C4 and C4A instances to list of tested instances in docs.
2. Reference that the user should know that not all zones can be used for installation of these types. There is no way for the installer to know if these instances can actually be installed in the zones. To successfully install in some zones, specify the zones in the control plane and compute machine pools (in the install config).
The transition from runc to crun is part of OpenShift’s broader strategy for improved performance and security. In OpenShift clusters with hosted control planes, retaining the original runtime during upgrades was considered complex and unnecessary, given the success of crun in tests and the lack of proof for significant risk. This decision aligns with OpenShift’s default container runtime upgrade and simplifies long-term support.
Deployment Configurations | Specific Needs |
---|---|
Self-managed, managed, or both | Both |
Classic (standalone cluster) | N/A |
Hosted control planes | Yes |
Multi-node, Compact (three-node), SNO | All |
Connected / Restricted Network | N/A |
Architectures (x86_64, ARM, IBM Power, IBM Z) | All |
Backport needed | None |
UI Needs | No additional UI needs. OCM may require an acknowledgment for runtime change. |
Scenario 1:
A user upgrading from OpenShift 4.17 to 4.18 in a HyperShift environment has NodePools running runc. After the upgrade, the NodePools automatically switch to crun without user intervention, providing consistency across all clusters.
Scenario 2:
A user concerned about performance with crun in 4.18 can create a new NodePool to test workloads with crun while keeping existing NodePools running runc. This allows for gradual migration, but default behavior aligns with the crun upgrade.
Scenario 2 needs to be well documented as best practice.
Based on this conversation, we should make sure we document the following:
As a customer I would like to know how the runtime change from runc to crun could affect me, for that we will need to
Description of criteria:
As a customer I want to upgrade my HostedCluster from 4.17 to 4.18, so I can verify:
If any of the points above fails, we need to fill a bug in order to solve it and put it under same Epic as this user story.
Description of criteria:
We aim to continue establishing a comprehensive testing strategy for Hosted Control Planes (HCP) that aligns with Red Hat’s support requirements and ensures customer satisfaction. This involves testing across various permutations, including providers, lifecycle, upgrades, and version compatibility. The testing must span management clusters, hubs, MCE, control planes, and nodepools, while coordinating across multiple QE teams to avoid duplication and inefficiencies. We aim to sustain an evolving testing matrix to meet product demands, especially as new versions and extended OCP lifecycles are introduced.
See: https://docs.google.com/spreadsheets/d/1j8TjMfyCfEt8OzTgvrAG3tuC6WMweBh5ElzWu6oAvUw/edit?gid=0#gid=0
The HCP architecture introduces decoupled control planes and worker nodes, significantly increasing the number of testing permutations. Ensuring these scenarios are tested is crucial to maintaining product quality, customer satisfaction, and stay compliant as an OpenShift form-factor.
This was attempted once before
https://github.com/openshift/release/pull/47599
Then reverted
https://github.com/openshift/release/pull/48326
ROSA HCP prod runs with HO from main but 4.14 and 4.15 HCs (currently), however, we do not test these together in presubmit testing, increases the chance of an escape.
OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.
In phase 1 provided tech preview for GCP.
In phase 2, GCP support goes to GA and AWS goes to TP.
In phase 3, AWS support goes to GA and vsphere goes TP.
This epic will encompass work involved to GA the boot image update feature for the AWS platform.
This work will involve bumping the API in the MCO repo, capturing the new feature gate changes.
Per GA requirements, we are required to add five tests to openshift/origin. This story will encompass part of that work.
Per GA requirements, we are required to add tests to openshift/origin. This story will encompass that work.
This work will involve updating the feature gate in the openshift/api.
Example: https://github.com/openshift/api/pull/1975
This will be blocked by MCO-1304. Once we lands those tests, it will need some soaking time as indicated by the GA requirements.
To introduce tests for new permissions required as pre-submit tests on PRs so that PR authors can see whenever their changes affect the minimum required permissions
Currently, the process is that QE installs with the documented minimum permissions, which starts failing whenever something new unknowingly requires additional permissions.
That test runs once a week. When it fails QE reviews and files bugs, the Installer then goes and adds them to a file which tracks the required permissions in the installer repo.
The issue is that it takes some time to get a permissions change implemented by AWS, so the late discovery of a need can become a release blocker
Early test new minimum permissions required to deploy OCP on AWS so ROSA can be informed before any feature that alters the minimum permissions requirements gets released.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
This is an internal-only feature and should not require any user-facing documentation
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Review, refine and harden the CAPI-based Installer implementation introduced in 4.16
From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.
Review existing implementation, refine as required and harden as possible to remove all the existing technical debt
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
There should not be any user-facing documentation required for this work
We need a place to add tasks that are not feature oriented.
The agent installer does not require the infra-env id to be present in the claim to perform the authentication.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The agent installer does not require the infra-env id to be present in the claim to perform the authentication.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Once a cloud provider uses CAPI by default, the feature gate it used becomes tech debt.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.
Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.
Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.
This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.
Goal
Refactor and modularize controllers and other components to improve maintainability, scalability, and ease of use.
Move any nto related logic from nodepool controller into a single reconcile() func that is implemented in nto.go
As dev I want to understand at a glance which conditions are relevant for the NodePool.
As dev I want to have the ability to add/collapse conditions easily.
As dev I want any conditions expectation to be unit testable.
Abstract away in a single place all the logic related to token and userdata secrets consuming the output of https://issues.redhat.com/browse/HOSTEDCP-1678
This should result in a single abstraction i.e. "Token" that expose a thin library e.g. Reconcile() and hide all details for token/userdata secrets lifecycle
As as dev I want to easily add and understand which input results in triggering a nodepool upgrade.
There's many scattered things that triggers nodepool rolling upgrade on change.
For code sustainability it'd be good to try to have a common abstraction that discovers all of them based on an input and return the authoritative hash for any targeted config version in time.
Related https://github.com/openshift/hypershift/pull/4057
https://github.com/openshift/hypershift/pull/3969#discussion_r1587198191
Following up to abstracting pieces into cohesively units, capi is the next logic choice since there's many reconciliation business logic for it in the NodePool controller.
Goals:
All capi related logic is driven by a single abstraction/struct.
Almost full unit test coverage
Deeper refactor of the concrete implementation logic is left out of the scope for gradual test driven follow ups
As a (user persona), I want to be able to:
https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. We need to refactor every component to use this abstraction.
Description of criteria:
All ControlPlane Components are refactored:
Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md
Provide a PR with a OAPI standard refactor
Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md
As a (user persona), I want to be able to:
Context:
If you ever had to add or modify a component to the control plane operator the need for this becomes very obvious. There should be possible to only add components manifest through a gated interface.
Right now adding a new component requires copy/paste hundreds of lines of boilerplate and there's plenty of room for side effects. A dev need to manually remember to set the right config like AutomountServiceAccountToken false, topology opinions...
We should refactor support/config and all the consumers in the CPO to enforce components creation through audited and common signature/interfaces.
Adding a new component is only possible through this higher abstractions
Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Focus on the general modernization of the codebase, addressing technical debt, and ensuring that the platform is easy to maintain and extend.
DoD:
Delete conversion webhook https://github.com/openshift/hypershift/pull/2267
This needs to be backward compatible for IBM.
Review IBM PRs: * https://github.com/openshift/hypershift/pull/1939
As a user of HyperShift, I want:
so that I can achieve
Description of criteria:
N/A
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a dev I want the base code to be easier to read, maintain and test
If devs are don't have a healthy dev environment the project will go and the business won't make $$
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...
Locations
frontend/public/components/
search-filter-dropdown.tsx (note: Steve has a branch that's converted this) [merged]
frontend/public/components/monitoring/
kebab-dropdown.tsx – code duplicated at https://github.com/openshift/monitoring-plugin/blob/main/web/src/components/kebab-dropdown.tsx and that version will be updated in https://issues.redhat.com/browse/OU-257 as the console version is eventually going away
ListPageCreate.tsx – addressed in https://issues.redhat.com//browse/CONSOLE-4118
alerting.tsx – code duplicated at https://github.com/openshift/monitoring-plugin/blob/main/web/src/components/alerting.tsx and that version should be updated in https://issues.redhat.com/browse/OU-561 as the console version is eventually going away
AC: Go though the mentioned files and swap the usage of DropdownDeprecated and KebabToggleDeprecated with PF components, based on their semantics (either Dropdown or Select components).
Note:
DropdownDeprecated and KebabToggleDeprecated are replaced with latest components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon
https://www.patternfly.org/components/menus/dropdown
https://www.patternfly.org/components/menus/select
Part of the PF6 adoption should be replacing TableDeprecated with the Table component
Location:
AC:
Locations
frontend/packages/console-shared/src/components/
GettingStartedGrid.tsx (has KebabToggleDeprecated)
Note
DropdownDeprecated is replaced with latest components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon
https://www.patternfly.org/components/menus/dropdown
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of DropdownDeprecated and KebabToggleDeprecated with PF components, based on their semantics (either Dropdown or Select components).
NodeLogs.tsx (two) [merged]
PerspectiveDropdown.tsx (??? Can not locate this dropdown in the UI. Reached out to Christoph but didn't hear back.)
UserPreferenceDropdownField.tsx [merged]
ClusterConfigurationDropdownField.tsx (??? Can not locate this dropdown in the UI) Dead code
PerspectiveConfiguration.tsx (options have descriptions) [merged]
Acceptance Criteria
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.
multiselectdropdown.tsx (multiple typeahead with placeholder and noResultsFoundText)
Note
SelectDeprecated and SelectOptionDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.
multiselectdropdown.tsx (multiple typeahead with placeholder and noResultsFoundText) only used in packages/local-storage-operator moved to https://issues.redhat.com/browse/CONSOLE-4227
UtilizationDurationDropdown.tsx (checkbox select, plain toggle, with placeholder text)
SelectInputField.tsx (uses most Select props) moved to https://issues.redhat.com/browse/ODC-7655
QueryBrowser.tsx (Currently using DropdownDeprecated, should be using a Select)
Note
SelectDeprecated and SelectOptionDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.
AC:
PatternFly demo using Dropdown and Menu components
https://www.patternfly.org/components/menus/application-launcher/
operator-channel-version-select.tsx (Two)
Acceptance Criteria
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.
Replace DropdownDeprecated
Replace SelectDeprecated
Acceptance Criteria
Note:
DropdownDeprecated and KebabToggleDeprecated are replaced with latest components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon
https://www.patternfly.org/components/menus/dropdown
https://www.patternfly.org/components/menus/select
resource-dropdown.tsx (checkbox, options have tooltips, grouped options, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)
filter-toolbar.tsx (grouped, checkbox select)
monitoring/dashboards/index.tsx (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead) covered by https://issues.redhat.com/browse/ODC-7655
silence-form.tsx (Currently using DropdownDeprecated, should be using a Select)
timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655
poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655
Note
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).
Locations
frontend/packages/console-app/src/components/
NavHeader.tsx [merged]
PDBForm.tsx (This should be a <Select>) [merged]
Acceptance Criteria:
DropdownDeprecated are replaced with latest components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/dropdown
https://www.patternfly.org/components/menus/select
Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...
Locations
frontend/packages/pipelines-plugin/src/components/
PipelineQuickSearchVersionDropdown.tsx (Currently using DropdownDeprecated, should be using a Select)
PipelineMetricsTimeRangeDropdown.tsx (Currently using DropdownDeprecated, should be using a Select)
Note
DropdownDeprecated are replaced with latest Select components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.
SecureRouteFields.tsx (Two)
Acceptance Criteria
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
KindFilterDropdown.tsx (checkbox select with custom content - not options)
FilterDropdown.tsx (checkbox, grouped, switch component in select menu)
NameLabelFilterDropdown.tsx (Should be a Select component; Currently using DropdownDeprecated)
Acceptance Criteria
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
TelemetryConfiguration.tsx (options have descriptions)
TelemetryUserPreferenceDropdown.tsx (options have descriptions)
Acceptance Criteria
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
Locations
frontend/packages/topology/MoveConnectionModal.tsx
Note:
DropdownDeprecated are replaced with latest components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/dropdown
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select or Dropdown components.
Part of the PF6 adoption should be replacing TableDeprecated with the Table component
Location:
AC:
monitoring/dashboards/index.tsx (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)
timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select)
poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select)
SelectInputField.tsx (uses most Select props)
`FilterSelect`, `VariableDropdown`, `TimespanDropdown`, and `IntervalDropdown`are the components that need to be updated; frontend/packages/dev-console/src/components/monitoring/MonitoringPage.tsx is the only valid instance usage of `MonitoringDashboardsPage` as web/src/components/alerting.tsx is orphaned.
Note
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.
VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | N/A core storage |
Backport needed (list applicable versions) | None |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | TBD for TP |
Other (please specify) | n/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
UI for TP
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Customer should not use it in production atm.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Check which drivers support it for which parameters.
Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.
There are a number of features or use cases supported by metal IPI that currently do not work in the agent-based installer (mostly due to being prevented by validations).
In phased approach, we first need to close all the identified gaps in ABI (this feature).
In a second phase, we would introduce in the IPI flow the ABI technology, once its on par with the IPI feature-set.
Close the gaps identified in Baremetal IPI/ABI Feature Gap Analysis
Given that IPI (starting 4.10) with nmstate config, the overall configuration seems very similar apart from the fact the it spreaded into different files.
Given: a configuration that works for an IPI methods
When: i do agent based installation on the same configuration
Then: it works (with the exception that isos are entered manually)
Description of problem:
Currently the AdditionalTrustBundlePolicy is not being used and when set to a value other than "Proxyonly" generates a warning message {noformat} Warning AdditionalTrustBundlePolicy: always is ignored {noformat} There are certain configurations where its necessary to set this value, see more discussion in https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1727793787922199
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. In install-config.yaml set AdditionalTrustBundlePolicy to Always 2. Note the warning message that is output. 3.
Actual results:
AdditionalTrustBundlePolicy is unused.
Expected results:
AdditionalTrustBundlePolicy is used in cluster installation.
Additional info:
As we gain hosted control planes customers, that bring in more diverse network topologies, we should evaluate relevant configurations and topologies and provide a more thorough coverage in CI and promotion testing
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Cut down proxy issues in managed and self-managed hosted control planes
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported Hosted Control Planes node topologies |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | N/A |
Backport needed (list applicable versions) | Coverage over all supported releases |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) |
There's been a few significant customer bugs related to proxy configurations with Hosted Control Planes
Will increase reliability for customers, preventing regressions
Documentation improvements that better detail the flow of communication and supported configurations
E2E should probably cover both ROSA/HCP and ARO/HCP
As a (user persona), I want to be able to:
so that I can achieve
This requires/does not require a design proposal.
This requires/does not require a feature gate.
A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:
This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed (ROSA and ARO) |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported ROSA/HCP topologies |
Connected / Restricted Network | All supported ROSA/HCP topologies |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported ROSA/HCP topologies |
Operator compatibility | CPO and Operators depending on it |
Backport needed (list applicable versions) | TBD |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | No |
Discussed previously during incident calls. Design discussion document
SOP needs to be defined for:
Acceptance criteria:
The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.
The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.
Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.
Questions to be addressed:
Description of problem:
In 4.14 libreswan is running as a containerized process inside the pod. SOS-Reports and must-gathers are not collecting libreswan logs and xfrm information from the nodes which is making the debugging process heavier. This should be fixed by working with the sos-report team OR by changing our must-gather scripts in 4.14 alone. From 4.15 libreswan is a systemd process running on the host so the swan logs are gathered in sos-report For 4.14 specially during escalations gathering individual node data over and over is becoming painful for IPSEC. We need to ensure all the data required to debug IPSEC is collected in one place
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
As an OpenShift Administrator, I need to ensure that I rotate signing keys for self-managed Openshift Azure Entra Workload ID enabled clusters to comply with PCI-DSS v4 (see #8 on life cycle management) and NIST (see PCI “Tokenization Product Security Guidelines”) rules.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
When creating a self-managed Openshift cluster on Azure using Azure Entra Workload ID, a dedicated OIDC endpoint is created. This endpoint exposes a document located at .well-known/openid_configuration which contains key jwks_uri, that points itself to JSON Web Key Sets.
Regular key rotations are an important part of PCI-DSS v4 and NIST rules. To ensure PCI-DSS V4 requirements, a mechanism is needed to seamlessly rotate signing keys. Currently, we can only have one signing/private key present in the OpenShift cluster; however, JWKS supports multiple public keys.
This feature will be split into 2 phases:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64, ARM (aarch64) |
Operator compatibility | |
Backport needed (list applicable versions) | TBD (Affects OpenShift 4.14+) |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Related references
Additional references
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
As an OpenShift Administrator, I need to ensure that I rotate signing keys for self-managed short-term credentials enabled clusters (Openshift Azure Entra Workload ID, GCP Workload Identity, AWS STS) to comply with PCI-DSS v4 (see #8 on life cycle management) and NIST (see PCI “Tokenization Product Security Guidelines”) rules.
Add documentation to the cloud-credential-repo for how to rotate the cluster bound-service-account-signing-key to include adding the new key to the Microsoft Azure Workload Identity issuer file. The process should meet the following requirements:
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
link back to OCPSTRAT-1644 somehow
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
Failed ci jobs: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-multi-nightly-4.18-cpou-upgrade-from-4.15-aws-ipi-mini-perm-arm-f14/1842004955238502400 https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-arm64-nightly-4.18-cpou-upgrade-from-4.15-azure-ipi-fullyprivate-proxy-f14/1841942041722884096 The 4.15-4.18 upgrade failed at stage of 4.17 to 4.18 update while authentication operator degraded and unavailable due to APIServerDeployment_PreconditionNotFulfilled $ omc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.17.0-0.nightly-arm64-2024-10-03-172957 True True 1h44m Unable to apply 4.18.0-0.nightly-arm64-2024-10-03-125849: the cluster operator authentication is not available $ omc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.18.0-0.nightly-arm64-2024-10-03-125849 False False True 8h $ omc get co authentication -ojson|jq .status.conditions[] { "lastTransitionTime": "2024-10-04T04:22:39Z", "message": "APIServerDeploymentDegraded: waiting for .status.latestAvailableRevision to be available\nAPIServerDeploymentDegraded: ", "reason": "APIServerDeployment_PreconditionNotFulfilled", "status": "True", "type": "Degraded" } { "lastTransitionTime": "2024-10-04T03:54:13Z", "message": "AuthenticatorCertKeyProgressing: All is well", "reason": "AsExpected", "status": "False", "type": "Progressing" } { "lastTransitionTime": "2024-10-04T03:52:34Z", "reason": "APIServerDeployment_PreconditionNotFulfilled", "status": "False", "type": "Available" } { "lastTransitionTime": "2024-10-03T21:32:31Z", "message": "All is well", "reason": "AsExpected", "status": "True", "type": "Upgradeable" } { "lastTransitionTime": "2024-10-04T00:04:57Z", "reason": "NoData", "status": "Unknown", "type": "EvaluationConditionsDetected" }
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-arm64-2024-10-03-125849 4.18.0-0.nightly-multi-2024-10-03-193054
How reproducible:
always
Steps to Reproduce:
1. upgrade from 4.15 to 4.16, and then to 4.17, and then to 4.18 2. 3.
Actual results:
upgrade stuck on authentication operator
Expected results:
upgrade succeed
Additional info:
The issue is found in a control plane only update jobs(with paused worker pool), but it's not cpou specified because it can be reproduced in a normal chain upgrade from 4.15 to 4.18 upgrade.
Add OpenStackLoadBalancerParameters and add an option for setting the load-balancer IP address for only those platforms where it can be implemented.
As a user of on-prem OpenShift, I need to manage DNS for my OpenShift cluster manually. I can already specify an IP address for the API server, but I cannot do this for Ingress. This means that I have to:
I would like to simplify this workflow to:
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
\
Bump openshift/api in cluster-ingress-operator and use the new floatingIP field on platform OpenStack.
If config drive is available on the machine use it instead of metadata.
To overcome the OVN metadata issue, we are adding an additional IPv4 network so metadata can be reached over IPv4 instead of IPv6 and we got a working installation. Now, let's try with config-drive, so we avoid specifying an IPv4 network and get the VMs to be IPv6 only.
With the metadata support over IPv6 being included on OSP, we should updated MCO to use the IPv6 address on single stack IPv6 install.
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
One piece of information that we lost compared to oc adm upgrade command is which ClusterOperators are updated right now. Previously, we presented CVO's Progressing=True message that says:
waiting on cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage
The oc adm upgrade status output presents counts of updated/updating/pending operators, but does not say which ones are in which state. We should show this information somehow.
This is what we did for this card (for QE to verify):
- In the control plane section, we add a line of "Updating" to display the names of Cluster Operators that are being updated.
The following is an example.
= Control Plane = Assessment: Progressing Target Version: 4.14.1 (from 4.14.0) Updating: machine-config Completion: 97% (32 operators updated, 1 updating, 0 waiting) Duration: 14m (Est. Time Remaining: <10m) Operator Health: 32 Healthy, 1 Unavailable Updating Cluster Operators NAME SINCE REASON MESSAGE machine-config 1m10s - Working towards 4.14.1
The current format of the worker status line is consistent with the original format of the operator status line. However, the operator status line is being reworked and simplified as part of the OTA-1155. The goal of this task is to make the worker status line somewhat consistent with the newly modified operator status line and simplified.
The current worker status line (see the “Worker Status: ...” line):
= Worker Pool = Worker Pool: worker Assessment: Degraded Completion: 39% Worker Status: 59 Total, 46 Available, 5 Progressing, 36 Outdated, 12 Draining, 0 Excluded, 7 Degraded
The exact new format is not defined and is for the assignee to create.
A relevant Slack discussion: https://redhat-internal.slack.com/archives/CEGKQ43CP/p1706727395851369
The main goal of this task is to:
Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded
Worker Status: <Available and Updated>, <Available and Outdated> [from which X are paused], <Unavailable but Progressing (Progressing and thus Unavailable)>, <Unavailable AND NOT Progressing>
Definition of Done:
On the call to discuss oc adm upgrade status roadmap to server side-implementation (notes) we agreed on basic architectural direction and we can starting moving in that direction:
Let's start building this controller; we can implement the controller perform the functionality currently present in the client, and just expose it through an API. I am not sure how to deal with the fact that we won't have the API merged until it merges into o/api, which is not soon. Maybe we can implement the controller over a temporary fork of o/api and rely on manually inserting the CRD into the cluster when we test the functionality? Not sure.
We need to avoid committing to implementation details and investing effort into things that may change though.
This card only brings a skeleton of the desired functionality to the DevPreviewNoUpgrade feature set. Its purpose is mainly to enable further development by putting the necessary bits in place so that we can start developing more functionality. There's not much point in automating testing of any of the functionality in this card, but it should be useful to start getting familiar with how the new controller is deployed and what are its concepts.
For seeing the new controller in action:
1. Launch a cluster that includes both the code and manifests. As of Nov 11, #1107 is not yet merged so you need to use launch 4.18,openshift/cluster-version-operator#1107 aws,no-spot
2. Enable the DevPreviewNoUpgrade feature set. CVO will restart and will deploy all functionality gated by this feature set, including the USC. It can take a bit of time, ~10-15m should be enough though.
3. Eventually, you should be able to see the new openshift-update-status-controller Namespace created in the cluster
4. You should be able to see a update-status-controller Deployment in that namespace
5. That Deployment should have one replica running and being ready. It should not crashloop or anything like that. You can inspect its logs for obvious failures and such. At this point, its log should, near its end, say something like "the ConfigMap does not exist so doing nothing"
6. Create the ConfigMap that mimics the future API (make sure to create it in the openshift-update-status-controller namespace): oc create configmap -n openshift-update-status-controller status-api-cm-prototype
7. The controller should immediately-ish insert a usc-cv-version key into the ConfigMap. Its content is a YAML-serialized ClusterVersion status insight (see design doc). As of OTA-1269 the content is not that important, but the (1) reference to the CV (2) versions field should be correct.
8. The status insight should have a condition of Updating type. It should be False at this time (the cluster is not updating).
9. Start upgrading the cluster (it's a cluster bot cluster with ephemeral 4.18 version so you'll need to use --to-image=pullspec and probably force it
10. While updating, you should be able to observe the controller activity in the log (it logs some diffs), but also the content of the status insight in the ConfigMap changing. The versions field should change appropriately (and startedAt too), and the Updating condition should become True.
11. Eventually the update should finish and the Updating condition should flip to False again.
Some of these will turn into automated testcases, but it does not make sense to implement that automation while we're using the ConfigMap instead of the API.
Spun out of https://issues.redhat.com/browse/MCO-668
This aims to capture the work required to rotate the MCS-ignition CA + cert.
Original description copied from MCO-668:
Today in OCP there is a TLS certificate generated by the installer , which is called "root-ca" but is really "the MCS CA".
A key derived from this is injected into the pointer Ignition configuration under the "security.tls.certificateAuthorities" section, and this is how the client verifies it's talking to the expected server.
If this key expires (and by default the CA has a 10 year lifetime), newly scaled up nodes will fail in Ignition (and fail to join the cluster).
The MCO should take over management of this cert, and the corresponding user-data secret field, to implement rotation.
Reading:
- There is a section in the customer facing documentation that touches on this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html
- There's a section in the customer facing documentation for this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html that needs updating for clarification.
- There's a pending PR to openshift/api: https://github.com/openshift/api/pull/1484/files
- Also see old (related) bug: https://issues.redhat.com/browse/OCPBUGS-9890
- This is also separate to https://issues.redhat.com/browse/MCO-499 which describes the management of kubelet certs
We currently writing rootCA to disk via this template: https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/root-ca.yaml
Nothing that we know of currently uses this file and as it is templated via MC, any updates to configmap(root-ca in the kube-system namespace) used to generated this template will cause a MC roll-out. We will be updating this configmap as part of cert rotation in MCO-643, so we'd like to prevent unnecessary rotation by removing this template.
The machinesets in the machine-api namespace reference a user-data secret (per pool and can be customized) which stores the initial ignition stub configuration pointing to the MCS, and the TLS cert. This today doesn't get updated after creation.
The MCO now has the ability to manage some fields of the machineset object as part of the managed bootimage work. We should extend that to also sync in the updated user-data secrets for the ignition tls cert.
The MCC should be able to parse both install-time-generated machinesets as well as user-created ones, so as to not break compatibility. One way users are using this today is to use a custom secret + machineset to do non-MCO compatible ignition fields, for example, to partition disks for different device types for nodes in the same pool. Extra care should be taken not to break this use case
This feature introduces a new command oc adm upgrade recommend in Tech Preview that improves how cluster administrators evaluate and select version upgrades.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | standalone |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
Add docs for recommend command
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Changes in web-console/GUI and OC CLI where we will change number of update recommendations users see.
No console changes were made in 4.18, but we may follow up with those changes later if the tech-preview oc adm upgrade recommend is well received.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Doesn't have to be recommend, but a new subcommand so that we can rip out the oc adm upgrade output about "how is your cluster currently doing" (Failing=True, mid-update, etc.). The new subcommand would just be focused on "I haven't decided which update I want to move to next; help me pick", including the "I am thinking about 4.y.z, but I'm not completely sure yet; anything I should be aware of for that target?".
Definition of Done:
For this initial ticket, we can just preserve all the current next-hop output, and tuck it behind a feature-gate environment variable, so we can make future pivots in follow-up tickets.
Conditional update UXes today are built around the assumption that when an update is conditional, it's a Red Hat issue, and some future release will fix the bug, and an update will become recommended. On this assumption, UXes like oc adm upgrade and the web-console both mention the existence of supported-but-not-recommended update targets, but don't push the associated messages in front of the cluster administrator.
But there are also update issues like exposure to Kubernetes API removals, where we will never restore the APIs, and we want the admin to take action (and maybe eventually accept the risk as part of the update). Do we want to adjust our update-risk UXes to be more open about discussing risks. For example, we could expose the message for the tip-most Recommended!=True update? Or something like that? So the cluster admin could read the message, and decide for themselves if it was a "wait for newer releases" thing or a "fix something in my current cluster state" thing. I think this would reduce current confusion about "which updates is Upgradeable=False blocking?" (OCPBUGS-9013) and similar.
Some customers will want an older release than OTA-1272's longest-hops. --show-outdated-version might flood them with many old releases. This card is about giving them an option, maybe --version=4.17.99 that will show them context about that specific release, without distracting them with opinions about other releases.
We currently show all the recommended updates in decreasing order with --include-not-recommended to see all the updates-with-assessed-risks in decreasing order. But sometimes users want to update to the longest-hop, even if there are known risks. Or they want to read about the assessed risks, in case there's something they can do to their cluster to mitigate a currently-assessed risk before kicking off the update. This ticket is about adjusting oc's output to order roughly by release freshness. For example, for a 4.y cluster in a 4.(y+1) channel:
Because users are more likely to care about 4.(y+1).tip, even if it has assessed risks, than they are to care about 4.y.reallyOld, even if it doesn't have assessed risks.
Show some number of these by default, and then use --show-outdated-versions or similar to see all the results.
See Scott here and me in OTA-902 both pitching something in this space.
Blocked on OTA-1271, because that will give us a fresh, tech-preview subcommand, where we can iterate without worrying about breaking existing users, until we're happy enough to GA the new approach.
For example, on 4.12.16 in fast-4.13, oc adm upgrade will currently show between 23 and 91 recommended updates (depending on your exposure to declared update risks):
cincinnati-graph-data$ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph fast-4.13 | grep '^4[.]12[.]16 ->' | wc -l 23 cincinnati-graph-data$ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph fast-4.13 | grep '^4[.]12[.]16 ' | wc -l 91
but showing folks 4.12.16-to-4.12.17 is not worth the line it takes, because 4.12.17 is so old, and customers would be much better served by 4.12.63 or 4.12.64, which address many bugs that 4.12.17 was exposed to. With this ticket, oc adm upgrade recommend would show something like:
Recommended updates: VERSION IMAGE 4.12.64 quay.io/openshift-release-dev/ocp-release@sha256:1263000000000000000000000000000000000000000000000000000000000000 4.12.63 quay.io/openshift-release-dev/ocp-release@sha256:1262000000000000000000000000000000000000000000000000000000000000 Updates with known issues: Version: 4.13.49 Image: quay.io/openshift-release-dev/ocp-release@sha256:1349111111111111111111111111111111111111111111111111111111111111 Recommended: False Reason: ARODNSWrongBootSequence Message: Disconnected ARO clusters or clusters with a UDR 0.0.0.0/0 route definition that are blocking the ARO ACR and quay, are not be able to add or replace nodes after an upgrade https://access.redhat.com/solutions/7074686 There are 21 more recommended updates and 67 more updates with known issues. Use --show-outdated-versions to see all older updates.
Goal:
Provide a Technical Preview of Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.
Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.
The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.
At its core, OpenShift's implementation of Gateway API will be based on the existing Cluster Ingress Operator and OpenShift Service Mesh (OSSM). The Ingress Operator will manage the Gateway API CRDs (gatewayclasses, gateways, httproutes), install and configure OSSM, and configure DNS records for gateways. OSSM will manage the Istio and Envoy deployments for gateways and configure them based on the associated httproutes. Although OSSM in its normal configuration does support service mesh, the Ingress Operator will configure OSSM without service mesh features enabled; for example, using Gateway API will not require the use of sidecar proxies. Istio will be configured specifically to support Gateway API for cluster ingress. See the gateway-api-with-cluster-ingress-operator enhancement proposal for more details.
Additional information on each of the above items can be found here: Networking Definition of Planned
This feature is the place holder for all epics related to technical debt associated with node team
Add a note to the nodes.config object's status condition with a deprecation message to prevent the usage of cgroupv1 mode incase if the system is using cgroupsv1.
Slack discussion: https://redhat-internal.slack.com/archives/GK6BJJ1J5/p1719508346407769
API change to include the conditions for the status field of nodes.config object.
The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.
BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.
Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.
OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.
Why is this important? (mandatory)
OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
Done - Checklist (mandatory)
The test will serve as a development aid to test functionality as it gets added; the test will be extended/adapted as new features are implemented. This test will live behind the "ExternalOIDC" feature gate.
Goals of the baseline test:
Update OpenShift router to recognize a new annotation key "haproxy.router.openshift.io/ip_allowlist" in addition to the old "haproxy.router.openshift.io/ip_whitelist" annotation key. Continue to allow the old annotation key for now, but use the new one if it is present.
In a future release, we may remove the old annotation key, after allowing ample time for route owners to migrate to the new one. (We may also consider replace the annotation with a formal API field.)
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
|
4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 | 82 | 82 |
fix needed | 68 | 68 | 68 | 68 | 68 | 68 |
fixed | 39 | 39 | 35 | 32 | 39 | 1 |
remaining | 29 | 29 | 33 | 36 | 29 | 67 |
~ remaining non-runlevel | 8 | 8 | 12 | 15 | 8 | 46 |
~ remaining runlevel (low-prio) | 21 | 21 | 21 | 21 | 21 | 21 |
~ untested | 2 | 2 | 2 | 2 | 82 | 82 |
# | namespace | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |||
2 | openshift-apiserver-operator | #573 | #581 | ||||
3 | openshift-authentication | #656 | #675 | ||||
4 | openshift-authentication-operator | #656 | #675 | ||||
5 | openshift-catalogd | #50 | #58 | ||||
6 | openshift-cloud-credential-operator | #681 | #736 | ||||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |||
8 | openshift-cluster-csi-drivers | #6 #118 | #524 #131 #306 #265 #75 | #170 #459 | #484 | ||
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||||
10 | openshift-cluster-olm-operator | #54 | n/a | n/a | |||
11 | openshift-cluster-samples-operator | #535 | #548 | ||||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |||
13 | openshift-cluster-version | #1038 | #1068 | ||||
14 | openshift-config-operator | #410 | #420 | ||||
15 | openshift-console | #871 | #908 | #924 | |||
16 | openshift-console-operator | #871 | #908 | #924 | |||
17 | openshift-controller-manager | #336 | #361 | ||||
18 | openshift-controller-manager-operator | #336 | #361 | ||||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 | ||
20 | openshift-image-registry | #1008 | #1067 | ||||
21 | openshift-ingress | #1032 | |||||
22 | openshift-ingress-canary | #1031 | |||||
23 | openshift-ingress-operator | #1031 | |||||
24 | openshift-insights | #1033 | #1041 | #1049 | #915 | #967 | |
25 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 | ||
26 | openshift-kube-storage-version-migrator | #107 | #112 | ||||
27 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||||
28 | openshift-machine-api | #1308 #1317 |
#1311 | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 | |
29 | openshift-machine-config-operator | #4636 | #4219 | #4384 | #4393 | ||
30 | openshift-manila-csi-driver | #234 | #235 | #236 | |||
31 | openshift-marketplace | #578 | #561 | #570 | |||
32 | openshift-metallb-system | #238 | #240 | #241 | |||
33 | openshift-monitoring | #2298 #366 | #2498 | #2335 | #2420 | ||
34 | openshift-network-console | #2545 | |||||
35 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |||
36 | openshift-network-node-identity | #2282 | #2490 | #2496 | |||
37 | openshift-nutanix-infra | #4504 | #4539 | #4540 | |||
38 | openshift-oauth-apiserver | #656 | #675 | ||||
39 | openshift-openstack-infra | #4504 | #4539 | #4540 | |||
40 | openshift-operator-controller | #100 | #120 | ||||
41 | openshift-operator-lifecycle-manager | #703 | #828 | ||||
42 | openshift-route-controller-manager | #336 | #361 | ||||
43 | openshift-service-ca | #235 | #243 | ||||
44 | openshift-service-ca-operator | #235 | #243 | ||||
45 | openshift-sriov-network-operator | #995 | #999 | #1003 | |||
46 | openshift-user-workload-monitoring | #2335 | #2420 | ||||
47 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 | ||
48 | (runlevel) kube-system | ||||||
49 | (runlevel) openshift-cloud-controller-manager | ||||||
50 | (runlevel) openshift-cloud-controller-manager-operator | ||||||
51 | (runlevel) openshift-cluster-api | ||||||
52 | (runlevel) openshift-cluster-machine-approver | ||||||
53 | (runlevel) openshift-dns | ||||||
54 | (runlevel) openshift-dns-operator | ||||||
55 | (runlevel) openshift-etcd | ||||||
56 | (runlevel) openshift-etcd-operator | ||||||
57 | (runlevel) openshift-kube-apiserver | ||||||
58 | (runlevel) openshift-kube-apiserver-operator | ||||||
59 | (runlevel) openshift-kube-controller-manager | ||||||
60 | (runlevel) openshift-kube-controller-manager-operator | ||||||
61 | (runlevel) openshift-kube-proxy | ||||||
62 | (runlevel) openshift-kube-scheduler | ||||||
63 | (runlevel) openshift-kube-scheduler-operator | ||||||
64 | (runlevel) openshift-multus | ||||||
65 | (runlevel) openshift-network-operator | ||||||
66 | (runlevel) openshift-ovn-kubernetes | ||||||
67 | (runlevel) openshift-sdn | ||||||
68 | (runlevel) openshift-storage |
We should be able to correlate flows with network policies:
PoC doc: https://docs.google.com/document/d/14Y3YYFxuOs3o-Lkipf-d7ZZp5gpbk6-01ZT_fTraCu8/edit
There are two possible approaches in terms of implementation:
The PoC describes the former, however it is probably most interesting to aim the latter. (95% of the PoC is valid in both cases, ie. all the "low level" parts: OvS, OVN). The latter involves more work in FLP.
We need to do a lot of R&D and fix some known issues (e.g., see linked BZs).
R&D targetted at 4.16 and productisation of this feature in 4.17
Goal
To make the current implementation of the HAProxy config manager the default configuration.
Objectives
https://issues.redhat.com/browse/NE-1788 describes 3 gaps in the implementation of DAC:
Additional gaps were discovered along the way:
This story aims at fixing those gaps.
The goal of this user story is to combine the code from the smoke test user story and results from the spike into an implementation PR.
Since multiple gaps were discovered a feature gate will be needed to ensure stability of OCP before the feature can be enabled by default.
Initiative: Improve etcd disaster recovery experience (part3)
With OCPBU-252 and OCPBU-254 we create the foundations for an enhanced experience of a recovery procedure in the case of full control plane loss. This requires researching total control-plane failure scenarios of clusters deployed using the various deployment methodologies.
Epic Goal*
Improve the disaster recovery experience by providing automation for the steps to recover from an etcd quorum loss scenario.
Determining the exact format of the automation (bash script, ansible playbook, CLI) is a part of this epic but ideally it would be something the admin can initiate on the recovery host that then walks through the disaster recovery steps provided the necessary inputs (e.g backup and staticpod files, ssh access to the recovery and non-recovery hosts etc).
Why is this important? (mandatory)
There are a large number of manual steps in the currently documented disaster recovery workflow which customers and support staff have voiced concerns as being too cumbersome and error prone.
https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html
Providing more automation would improve that experience and also let the etcd team better support and test the disaster recovery workflow.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
(TBD based on the delivery vehicle for the automation):
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
After running the quorum restore script we want to bring the other members back into the cluster automatically.
Currently the init container in
is guarding that case by checking whether the member is part of the cluster already and has an empty datadir.
We need to adjust this check by testing whether the cluster id of the currently configured member and the current datadir refer to the same cluster.
When we detect a mismatch, we can assume the cluster was recovered by quorum restore and we can attempt to move the folder to automatically make the member join the cluster again.
We need to add an e2e test to our disaster recovery suite in order to exercise that the quorum can be restored automatically.
While we're at it, we can also disable the experimental rev bumping introduced with:
https://github.com/openshift/origin/pull/28073
Several steps are covering the shutdown of the etcd static pod. We can provide a script to execute, which you can simply run through ssh:
> ssh core@node disable-etcd.sh
That script should move the static pod into a different folder, wait for the containers to shutdown.
Currently we have the bump guarded by an env variable:
and a hardcode with 1bn revision numbers in:
https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/restore-pod.yaml#L64-L68
With this story we should remove the feature flag and enable the bumping by default. The bump amount should come from the file created in ETCD-652 plus some slack percentage. If the file doesn't exist we assume the default value of a billion again.
with the downstream carry merged in ETCD-696, we need to implement the flag in CEO.
based on --force-new-cluster we need to add a quorum restore script that will only do that, without any inputs.
To enable resource version bumps on restore, we would need to know how far into the future (in terms of revisions) we need to bump.
We can get this information by requesting endpoint status on each member and using the maximum of all RaftIndex fields as the result. Alternatively by finding the current leader and getting its endpoint status directly.
Even though this is not an expensive operation, this should be polled in a sensible interval, e.g. once every 30s.
The result should be written as a textfile in the hostPath /var/lib/etcd that is already mounted on all relevant pods. An additional etcd sidecar container should be the most sensible choice to run this code.
Currently the readiness probe (of the guard pod) will constantly fail because the restore pod containers do not have the readyZ sidecar container.
Example error message:
> Oct 16 13:42:52 ci-ln-s2hivzb-72292-6r8kj-master-2 kubenswrapper[2624]: I1016 13:42:52.512331 2624 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-guard-ci-ln-s2hivzb-72292-6r8kj-master-2" podUID="2baa50c6-b5cd-463e-9b35-165570e94b76" containerName="guard" probeResult="failure" output="Get \"https://10.0.0.4:9980/readyz\": dial tcp 10.0.0.4:9980: connect: connection refused"
AC:
To be broken into one feature epic and a spike:
The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:
The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:
Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:
Etc. etc.
Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:
With a side objective of observability, including reporting all the way to the operator status items such as:
Approaches can include:
The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:
Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:
Etc. etc.
Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:
With a side objective of observability, including reporting all the way to the operator status items such as:
Approaches can include:
Description:
MCC sends drain alert when node drain doesn't succeed within drain timeout period (1 hour today). This is to make sure that admin takes appropriate action if required by looking at MCC pod logs. Alert contains the information on where to look for the logs.
Example alert looks like:
Drain failed on Node <node_name>, updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller
It is possible that admin may not be able to interpret exact action to be taken after looking at MCC pod logs. Adding runbook (https://github.com/openshift/runbooks) can help admin in better troubleshooting and taking appropriate action.
Acceptance Criteria:
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
We expect every openshift cluster that relies on Cluster API to have an infrastructure cluster and a cluster object.
These resources should exist for the lifetime of the cluster and should not be able to be removed.
We must ensure that infracluster objects from supported platforms cannot be deleted once created.
Changes to go into the cluster-capi-operator.
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
Implement Migration core for MAPI to CAPI for AWS
When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
MAPA has support for users to configure the Network DeviceIndex.
According to aws, the primary network interface must use the value 0.
It appears that CAPA already forces this (it only supports creating one primary network interface) or assigns these values automatically if you are supplying your own network interfaces.
Therefore, it is likely that we do not need to support this value (MAPA only supports a single network interface), but we must be certain.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
We want to build out a sync controller for both Machine and MachineSet resources.
This card is about bootstrapping the basics of the controllers, with more implementation to follow once we have the base structure.
For this card, we are expecting to create 2 controllers, one for Machines, one for MachineSets.
The MachineSet controller should watch MachineSets from both MachineAPI and ClusterAPI in the respective namespaces that we care about. It should also be able to watch the referenced infrastructure templates from the CAPI MachineSets.
For the Machine controller, it should watch both types of Machines in MachineAPI and ClusterAPI in their respective namespaces. It should also be able to watch for InfrastructureMachines for the CAPI Machines in the openshift-cluster-api namespace.
If changes to any of the above resources occur, the controllers should trigger a reconcile which will fetch both the Machine API and Cluster API versions of the resources, and then split the reconcile depending on which version is authoritative.
Deletion logic will be handled by a separate card, but will need a fork point in the main reconcile that accounts for if either of the resources have been deleted, once they have been fetched from the cache.
Note, if a MachineSet exists only in CAPI, the controller can exit and ignore the reconcile request.
If a Machine only exists in CAPI, but is owned by another object (MachineSet for now) that is then mirrored into MAPI, the Machine needs to be reconciled so that we can produce the MAPI mirror of the Machine.
We have now merged a design for the MAPI to CAPI library, but, have not been extensively testing it up to now.
There are a large number of fields that currently cannot be converted, and we should ensure each of these is tested.
Fuzz testing should be used to create round trip testing and pick up issues in conversion.
Fuzz tests auto generate data to put into fields and we can ensure that combinations of fields are converted appropriately and also pick up when new fields are introduced into the APIs by fuzz testing and ensuring that fields are correctly round tripped.
We would like to set up a pattern for fuzz testing that can be used across various providers as we implement new provider conversions.
When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.
When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.
The core of the Machine API to Cluster API conversion will rely on a bi-directional conversion library that can convert providerSpecs from Machine API into InfraTemplates in Cluster API, and back again.
We should aim to have a platform agnostic interface such that the core logic of the migration mechanism need not care about platforms specific detail.
The library should also be able to return errors when conversion is not possible, which may occur when:
These errors should resemble the API validation errors from webhooks, for familiarity, using utils such as `field.NewPath` and the InvalidValue error types.
We expect this logic to be used in the core sync controllers, responsible for converting Machine API resources to Cluster API resources and vice versa.
DoD:
To be able to continue to operate MachineSets, we need a backwards conversion once the migration has occurred. We do not expect users to remove the MAPI MachineSets immediately, and the logic will be required for when we remove the MAPI controllers.
This covers the case where the CAPI MachineSet is authoritative or only a CAPI MachineSet exists.
Epic Goal
This is the epic tracking the work to collect a list of TLS artifacts (certificates, keys and CA bundles).
This list will contain a set of required an optional metadata. Required metadata examples are ownership (name of Jira component) and ability to auto-regenerate certificate after it has expired while offline. In most cases metadata can be set via annotations to secret/configmap containing the TLS artifact.
Components not meeting the required metadata will fail CI - i.e. when a pull request makes a component create a new secret, the secret is expected to have all necessary metadata present to pass CI.
This PR will enforce it WIP API-1789: make TLS registry tests required
In order to keep track of existing certs/CA bundles and ensure that they adhere to requirements we need to have a TLS artifact registry setup.
The registry would:
Ref: API-1622
To improve automation, governance and security, AWS customers extensively use AWS Tags to track resources. Customers wants the ability to change user tags on day 2 without having to recreate a new cluster to have 1 or more tags added/modified.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
This feature will only apply to ROSA with Hosted Control Planes, and ROSA Classic / standalone is excluded.
Support reconciliation of tags on day2 updates from Infrastructure.status field.
Acceptance criteria
1. Successful updates on tag information updates.
2. Conflict error handling.
3. Unit testcases.
4. e2e testcases.
Allow the user to create the Agent ISO image as a minimal ISO (sans rootfs).
This is supported for the external platform, added for OCI in 4.14. This adds support for the rest of the platforms supported by the agent-based installer.
All platforms supported by agent can install using a minimal ISO:
Currently the agent-based installer creates a full ISO for all platforms, except for OCI (External) platforms in which a minimal ISO is created by default. Work is being done to support the minimal ISO for all platforms. In this case either a new command must be used to create the minimal, instead of full ISO, or a flag added to the "agent create image"
command.
UPDATE 9/30: From feedback from Zane (https://github.com/openshift/installer/pull/9056#discussion_r1777838533) the plan has changed to use a new field in agent-config.yaml to define that a minimal ISO should be generated instead of either a new command, or flag to existing command.
Currently minimal ISO support is only provided for the External platform (see https://issues.redhat.com//browse/AGENT-702). As part of the attached Epic, all platforms will now support minimal ISO. The checks that limit minimal ISO to External platform only should be removed.
With the addition of a new field in agent-config.yaml to create a minimal ISO that can be used on all platforms, an integration test should be added to test this support.
The integration test can check that the ISO created is below the size expected for a full ISO and also that the any ignition files are properly set for minimal ISO support.
Currently the internal documentation describes creating a minimal ISO only for an External platform. With the change to support minimal ISO on all platforms, the documentation should be uldated.
Migrate every occurrence of iptables in OpenShift to use nftables, instead.
Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)
iptables is going away in RHEL 10; we need to replace all remaining usage of iptables in OCP with nftables before then.
The gcp-routes and azure-routes scripts in MCO use iptables rules and need to be ported to use nftables.
Goal Summary
This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.
The Cluster Storage Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The Cloud Storage Operator needs to pass the Secret Provider Class to azure-disk and azure-file csi controllers so they can authenticate with client certificate.
Azure SDK
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The Cloud Ingress Operator would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function.
Azure SDK
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
Epic Goal*
Support Managed Service Identity (MSI) authentication in Azure.
Why is this important? (mandatory)
This is a requirement to run storage controllers that require cloud access on Azure with hosted control plane topology.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
We discovered that the azure-disk and azure-file-csi-controllers are reusing CCM managed identity. Each of these three components should have their own managed identity and not reuse another's managed identity.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an AKS mgmt cluster 2. Create a HCP with MI 3. Observe azure-disk and azure-file controllers are reusing azure CCM MI
Actual results:
the azure-disk and azure-file-csi-controllers are reusing CCM managed identity
Expected results:
the azure-disk and azure-file-csi-controllers should each have their own managed identity
Additional info:
The Cluster Ingress Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The Cloud Ingress Operator would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function.
Azure SDK
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The image registry can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The image registry would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function.
Azure SDK
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation.
Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion.
Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.
Operators running management side needing to access azure customer account will use MSI.
Operands running in the guest cluster should rely on workload identity.
This ticket is to solve the latter.
We need to implement workload identity support in our components that run on the spoke cluster.
Address any TODOs in the code related to this ticket.
https://redhat-external.slack.com/archives/C075PHEFZKQ/p1727710473581569
https://docs.google.com/document/d/1xFJSXi71bl-fpAJBr2MM1iFdUqeQnlcneAjlH8ogQxQ/edit#heading=h.8e4x3inip35u
If we decided to drop the msi init and adapter and expose the certs in management cluster directly via Azure Key Vault Secret Store CSI Driver Pods volume. This would remove complexity and avoid the need for highly permissive pods with net access.
Action items:
func azureCreds(options *azidentity.DefaultAzureCredentialOptions) (*azidentity.DefaultAzureCredential, error) {
if certPath := os.Getenv("AZURE_CLIENT_CERTIFICATE_PATH"); certPath != "" {
// Set up a watch on our config file; if it changes, we should exit -
// (we don't have the ability to dynamically reload config changes).
if err := watchForChanges(certPath, stopCh); err != nil
}
return azidentity.NewDefaultAzureCredential(options)
}
Proof of Concept with Ingress as the example OpenShift component - https://github.com/openshift/hypershift/pull/4841/commits/35ac5fd3310b9199309e9e8a47ee661771ec71cf
AZ CLI command to create the key vault
# Create Management Azure Key Vault az keyvault create \ --name ${PREFIX} \ --resource-group ${AKS_RG} \ --location ${LOCATION} \ --enable-rbac-authorization
AZ CLI command to create the managed identity for the key vault
## Create the managed identity for the Management Azure Key Vault az identity create --name "${AZURE_KEY_VAULT_AUTHORIZED_USER}" --resource-group "${AKS_RG}" AZURE_KEY_VAULT_AUTHORIZED_USER_ID=$(az identity show --name "${AZURE_KEY_VAULT_AUTHORIZED_USER}" --resource-group "${AKS_RG}" --query principalId --output tsv) az role assignment create \ --assignee-object-id "${AZURE_KEY_VAULT_AUTHORIZED_USER_ID}" \ --role "Key Vault Secrets User" \ --scope /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/"${AKS_RG}" \ --assignee-principal-type ServicePrincipal
AZ CLI command that creates a Service Principal with a backing cert stored in the Azure Key Vault
az ad sp create-for-rbac --name ingress --role "Contributor" --scopes /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${MANAGED_RG_NAME} --create-cert --cert ${CERTIFICATE_NAME} --keyvault ${KEY_VAULT_NAME}
This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Before GAing Azure let's make sure we do a final API review
Before GAing Azure API needs to go through review.
https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/nodepool_types.go#L430
https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/hostedcluster_types.go#L877
Goal:
As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.
Problem:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
On the bootstrap node, keep NetworkManager generated resolv.conf updated with the nameserver pointing to the localhost.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This epic covers the scope of automation-related stories in ODC
Automation enhancements for ODC
Description of problem:
If knative operator is installed without creation of any of its instances tests will fail
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Install knative operator without creation of any one or all three instances 2. Run knative e2e tests 3.
Actual results:
Tests will fail saying: Error from server particular instance not found
Expected results:
Mechanism should be present to create missing instance
Additional info:
Description of problem:
Enabling the topology tests in CI
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
After the addition of CLI method of operator installation, test doesn't necessarily require admin privileges. Currently test add an overhead of creating admin session and page navigations which are not required.
KN-05-TC05, KN-02-TC12, SF-01-TC06 are flaking on CI due to variable resource creation time and some other unknown factor which need to be identified.
Improve onboarding experience for using Shipwright Builds in OpenShift Console
Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright
Requirements | Notes | IS MVP |
Enable creating Shipwright Builds using a form | Yes | |
Allow use of Shipwright Builds for image builds during import flows | Yes | |
Enable access to build strategies through navigation | Yes |
TBD
TBD
Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.
TBD
TBD
TBD
TBD
TBD
Creating Shipwright Builds through YAML is complex and requires Shipwright expertise which makes it difficult for novice user to user Shipwright
Provide a form for creating Shipwright Builds
To simply adoption of Shipwright and ease onboarding
Create build
As a user, I want to create a Shipwright build using the form,
Event discovery allows for dynamic and interactive user experiences and event catalogs provide users with a structured way to discover available events within the system. Users can explore different event types, their descriptions, and associated metadata, making it easier to understand the capabilities and functionalities offered by the system.
By providing visibility into the available events and their characteristics, event catalogs help users understand how the system behaves and what events they can expect to occur as well as streamline the process of subscribing to and consuming events within the system.
Event catalogs provide users with a structured way to discover available events within the system. Users can explore different event types, their descriptions, and associated metadata, making it easier to understand the capabilities and functionalities offered by the system.
EventType doc: https://knative.dev/docs/eventing/features/eventtype-auto-creation/#produce-events-to-the-broker
As a user, I want to see the catalogs for the Knative Events.
As a user, I want to subscribe to the Knative service using a form
Placeholder for small Epics not justifying an standalone Feature, in the context of technical debt and ability to operate and troubleshoot. This Feature is not needed expect during planning phases when we plan Features, until we enter the Epic planning feature.
NO MORE ADDITION OF ANY EPIC post 4.18 planning - Meaning NOW. One Feature per Epic from now on!
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Targeted support is equivalent as SR-IOV kernel and MACVLAN, see https://issues.redhat.com/browse/CNF-1470 and https://issues.redhat.com/browse/CNF-5528
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Make the multinetwork-policy daemon manage networks of type `bond`.
This can be achieved by updating the argument `--network-plugins` in the cluster-network-operator:
To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.
Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.
There are currently three main obvious thrusts to how to go about this:
This BU priority focuses on the third of these approaches.
Reference Documents:
Both the Node Tuning Operator and TuneD assume the Intel x86 architecture is used when a Performance Profile is applied. For example, they both configure Intel x86 specific kernel parameters (e.g. intel_pstate).
In order to support Telco RAN DU deployments on the ARM architecture, we will need a way to apply a performance profile to configure the server for low latency applications. This will include tuning common to both Intel/ARM and tuning specific to one of the architectures.
The purpose of this Epic:
The validator for the Huge Pages sizes in NTO needs to be updated to account for more valid options.
Currently it only allows the values "1G" and "2M" but we want to be able to use "512M" on ARM. We may also want to support other values (https://docs.kernel.org/6.3/arm64/hugetlbpage.html) and we probably also want to validate that the size selected is at least valid for the architecture being used.
The validation is performed here: https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.16/pkg/apis/performanceprofile/v2/performanceprofile_validation.go#L56
Original slack thread: https://redhat-internal.slack.com/archives/CQNBUEVM2/p1717011791766049
This story will serve to collect minor upstream enhancements to NTO that do not directly belong to an objective story in the greater epic
Overview
An elevator pitch (value statement) that describes the parts of a Feature in a clear, concise way that will be addressed by his Epic
Acceptance Criteria
The list of requirements to be met to consider this Epic feature-complete
**
Done Criteria
References
Links to Gdocs, GitHub, and any other relevant information about this epic.
When setting Autorepair to enabled for a NodePool in OCM, NP controller from HyperShift apply a default CAPI MHC that is defined https://github.com/openshift/hypershift/blob/4954df9582cd647243b42f87f4b10d4302e2b270/hypershift-operator/controllers/nodepool/capi.go#L673 and that has a NodeStartupTimeout (from creation to joining the cluster of 20 minutes).
Bare metal instances are known to be slower to boot (see OSD-13791) and so in classic we have defined 2 MHC for workers node:
We should analyse together with the HyperShift team what is the best way forward to cover this use case.
Initial ideas to explore:
The behavior has been observed within XCMSTRAT-1039, but it is already present with bare metal instances and so can provoke poor UX (machine cycled until we are lucky to get a faster boot time).
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
2024-09-11: Commited to DevPreview, referencing downstream images using Helm charts from github.com/openshift/cluster-api-operator. This will coincide with the MCE release, but is not included in the bundle MCE 2.7.0. Bundling inclusion will be MCE 2.8.0.
2024-09-10: Decide on versioning strategy with OCP Install CAPI and HyperShift
2024-08-12: Having a discussion in #forum-capi-ocp on delivery mechanism
2024-08-08: Community meeting discussion on delivery of ROSA-HCP & Sylva Cluster-api-provider-metal3
2024-08-22: F2F meetings, inception of this EPIC
Include minimum CAPI components in ACM for the supported ROSA-HCP.
MCE 2.7.0 enables ROSA-HCP provisioning support, along with OCP starting to use CAPI
I deploy MCE, and should be able to deploy a ROSA-HCP cluster, with correct credentials and information about the cluster.
ROSA-HCP Cluster API for AWS support in MCE 2.7.0
Portal Doc template that you can access from [The Playbook](
and ensure doc acceptance criteria is met.
The downstream capi-operator has a helm chart defined at [1].
We need to:
[1] https://github.com/openshift/cluster-api-operator/blob/main/index.yaml
Portal Doc template that you can access from [The Playbook](
and ensure doc acceptance criteria is met.
We need to create and publish the index.yaml for the ocp cluster-api-operator helm chart here https://github.com/openshift/cluster-api-operator/tree/main/openshift
the 4.17 release is published here https://github.com/stolostron/stolostron/releases/download/2.12/cluster-api-operator-4.17.tgz
We need to know which of the possible error codes reported in the imageregistry_storage_errors_total metric indicate abnormal operations, so that we can create alerts for the relevant metrics.
Current error codes are:
errCodeUnsupportedMethod = "UNSUPPORTED_METHOD" errCodePathNotFound = "PATH_NOT_FOUND" errCodeInvalidPath = "INVALID_PATH" errCodeInvalidOffset = "INVALID_OFFSET" errCodeReadOnlyFS = "READ_ONLY_FILESYSTEM" errCodeFileTooLarge = "FILE_TOO_LARGE" errCodeDeviceOutOfSpace = "DEVICE_OUT_OF_SPACE" errCodeUnknown = "UNKNOWN"
Source: openshift/image-registry/pkg/dockerregistry/server/metrics/errorcodes.go
Acceptance Criteria
ACCEPTANCE CRITERIA
ACCEPTANCE CRITERIA
After https://issues.redhat.com/browse/MGMT-17867 fix, the multipath includes the wwn hint. However, the path devices include this hint as well.
The current bmh_agent_controller code may choose any of the devices with the wwn hint as the root device hint.
The code has to be fixed so that in case of multiple devices with wwn hint, the multipath device should be preferred.
The external platform was created to allow cloud providers to supply their own integration components (cloud controller manager, etc.) without prior integration into openshift release artifacts. We need to support this new platform in assisted-installer in order to provide a user friendly way to enable such clusters, and to enable new-to-openshift cloud providers to quickly establish an installation process that is robust and will guide them toward success.
This epic is a follow-up of MGMT-15654 where the external platform API was implemented in Assisted-Installer.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
Manage the effort for adding jobs for release-ocm-2.12 on assisted installer
https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng
The repositories which we handle the cut-off for is currently:
Merge order:
Note - There is a CI tool for CRUD operations on jobs configuration in the process. We should try and use it for the next FF
Update the BUNDLE_CHANNELS in the Makefile in assisted-service and run bundle generation.
OCP 4.17 GA should be in 1.10.2024
LSO has not been published to the 4.18 redhat-operators catalog, so it cannot be installed on OpenShift 4.18. Until this is resolved, we explicitly install the 4.17 catalog as redhat-operators-v4-17 and then subscribe to the LSO version from the 4.17 rather than the 4.18 catalog.
Convert Cluster Configuration single page form into a multi-step wizard. The goal is to avoid overwhelming user with all information on a single page, provide guidance through the configuration process.
Wireframes:
Phase1:
https://marvelapp.com/prototype/fjj6g57/screen/76442394
Future:
https://marvelapp.com/prototype/78g662d/screen/71444815
https://marvelapp.com/prototype/7ce7ib3/screen/73190117
Phase 1 wireframes: https://marvelapp.com/prototype/fjj6g57/screen/76442399
This requires UX investigation to handle the case when base dns is not set yet and clusters list has several clusters with the same name.
The API for it is https://github.com/openshift/assisted-service/blob/2bbbcb60eea4ea5a782bde995bdec3dd7dfc1f62/swagger.yaml#L5636
Other assets
https://github.com/openshift/installer/blob/master/docs/user/customization.md
Example
Adding day-1 kernel arguments
Marvel
Description of the problem:
V2CreateClusterManifest should block empty manifests
How reproducible:
100%
Steps to reproduce:
1. POST V2CreateClusterManifest manifest with empty content
Actual results:
Succeeds. Then silently breaks bootkube much later.
Expected results:
API call should fail immediately
CMO creates a default Alertmanager configuration on cluster bootstrap. The configuration should have the following snippet when a cluster proxy is configured:
global: http_config: proxy_from_environment: true
1. Proposed title of this feature request
Prometheus generating disk activity every two hours causing storage backend issues.
2. What is the nature and description of the request?
We're seeing Prometheus doing some type of disk activity every two hours on the hour on all of our clusters. We'd like to change that default setting so that all clusters aren't hitting our storage at the same time. Need help in finding where to make that config change. I see a knowledgebase article which says this is by design, but we'd like to stagger these if possible. [1][2]
3. Why does the customer need this? (List the business requirements here)
It appears to be impacting their storage clusters. They use Netapp Trident NFS as their PVC backing which serves multiple clusters and the Prometheus-k8s pods use Netapp Trident NFS PVCs for their data. It appears that this 2 hour interval job occurs at the exact time in every cluster and their hope is stagger this in each cluster such as:
Those two hours for every cluster are midnight, 2:00AM, 4:00AM, etc... The question I've had is, can we change it so one cluster does midnight, 2:00AM, 4:00AM, etc... and another cluster does 12:15AM, 2:15AM, 4:15AM, etc... so they both aren't writing to storage at the same time? It's still a 2 hr default.
4. List any affected packages or components.
openshift-monitoring
[1] https://access.redhat.com/solutions/6960833
[2] https://prometheus.io/docs/prometheus/latest/storage/
Upstream issue: https://github.com/prometheus/prometheus/issues/8278
change proposal accepted at Prometheus dev summit: https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit#heading=h.4t8053ed1gi
In terms of risks:
Proposed title of this feature request
Ability to modify UWM Prometheus scrape interval
What is the nature and description of the request?
Customer would like to be able to modify the scrape interval in Prometheus user workload monitoring
Why does the customer need this? (List the business requirements)
Control metric frequency and thus remote write frequency for application monitoring metrics.
List any affected packages or components.
This needs to be done for both Prometheus and Thanos ruler.
The change only affects the UWM Prometheus.
Proposed title of this feature request
Collect accelerator metrics in OCP
What is the nature and description of the request?
With the rise of OpenShift AI, there's a need to collect metrics about accelerator cards (including but not limited to GPUs). It should require no to little configuration from the customers and we recommend deploying a custom text collector with node_exporter.
Why does the customer need this? (List the business requirements)
Display inventory data about accelerators in the OCP admin console (like we do for CPU, memory, ... in the Overview page).
Better understanding of which accelerators are used (Telemetry requirement).
List any affected packages or components.
node_exporter
CMO
Epic Goal
CPMSO support for Power VS was added via PR: https://github.com/openshift/installer/pull/7226 and it was inactive state by default to complete testing.
Now we have done enough testing and have confidence to make it active.
Update the relevant packages in go.mod file of machine-api-provider-powervs repository.
Remove the dependency on blumix-go.
In cloud-provider-powervs repository we can overiide the default iam endpoint if we set iamEndpointOverride : https://github.com/Karthik-K-N/cloud-provider-powervs/blob/7237bad1549aa4f74e5fa1f3d26592605a3f4ca9/ibm/ibm.go#L109.
Make necessary changes to support this.
Epic Goal*
OCP storage components (operators + CSI drivers) should not use environment variables for cloud credentials. It's discouraged by OCP hardening guide and reported by compliance operator. Our customers noticed it, https://issues.redhat.com/browse/OCPBUGS-7270
Why is this important? (mandatory)
We should honor our own recommendations.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
none
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Description of problem:
[AWS EBS CSI Driver] could not provision ebs volume succeed on cco manual mode private clusters
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-20-191204
How reproducible:
Always
Steps to Reproduce:
1. Install a private cluster with manual mode -> https://docs.openshift.com/container-platform/4.16/authentication/managing_cloud_provider_credentials/cco-short-term-creds.html#cco-short-term-creds-format-aws_cco-short-term-creds 2. Create one pvc and pod consume the pvc.
Actual results:
In step 2 the pod,pvc stuck at Pending $ oc logs aws-ebs-csi-driver-controller-75cb7dd489-vvb5j -c csi-provisioner|grep new-pvc I0723 15:25:49.072662 1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started I0723 15:25:49.073701 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc" I0723 15:25:49.656889 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain I0723 15:25:50.657418 1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started I0723 15:25:50.658112 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc" I0723 15:25:51.182476 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain
Expected results:
In step 2 the pv should become Bond(volume provision succeed) and pod Running well.
Additional info:
Epic Goal*
Out AWS EBS CSI driver operator misses some nice to have functionality. This Epic means to track it, so we finish it in some next OCP release.
Why is this important? (mandatory)
In general, AWS EBS CSI driver controller should be a good citizen in HyperShift's hosted control plane. It should scale appropriately, report metrics and not use kubeadmin privileges in the guest cluster.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Our operators use Unstructred client to read HostedControlPlane. HyperShift has published their API types that don't require many dependencies and we could import their types.go.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
We get too many false positive bugs like https://issues.redhat.com/browse/OCPBUGS-25333 from SAST scans, especially from the vendor directory. Add a .snyk file like https://github.com/openshift/oc/blob/master/.snyk to each repo to ignore them.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update OCP release number in OLM metadata manifests of:
OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56
We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)
EOL, do not upgrade:
The following operators were migrated to csi-operator, do not update these obsolete repos:
tools/library-bump.py and tools/bump-all may be useful. For 4.16, this was enough:
mkdir 4.16-bump cd 4.16-bump ../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16"
4.17 perhaps needs an older prometheus:
../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17"
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components
Assisted installer can create installed cluster and use it to perform day2 operations
A doc that explains how it's done with kube-api
Parameters that are required from the user:
Actions required from the user
To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster
Add the ability for the node-joiner tool to create a config image (analagous to the one generated by openshift install agent create config-image) with the configuration necessary to import a cluster and add a day 2 node but no OS.
The config image is small enough that we could probably create it unconditionally and leave it up to the client to decide which one to download.
Deploy Hypershift Operator component to the MCs in the MSFT INT environment.
Acceptance criteria
We generate the hypershift operator install manifests by running `hypershift install render`, catch STDOUT and store them. If non-critical errors occure during the generation step, the generated manifests are not processable anymore.
Example: proxy autodiscovery for external-dns fails if no kubeconfig is given. This will not fail the generation task but results in error messages intertwined with the rest of the generated manifests, making them not processable anymore.
We will add a new config parameter to `hypershift install render` to render the manifests to a file instead of STDOUT.
Description
This epic covers the changes needed to the ARO RP for
ACCEPTANCE CRITERIA:
What is "done", and how do we measure it? You might need to duplicate this a few times.
NON GOALS:
Only fill this out for Product Management / customer-driven work. Otherwise, delete it.
BREADCRUMBS:
Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.
NOTES:
Need to determine if (in 4.14 azure workload identity functionality) we need to create secrets/secret manifests for each operator manually as part of the ARO cluster install, or if we can leverage credentialsrequests to do this automatically somehow. How will necessary secrets be created?
DESCRIPTION:
ACCEPTANCE CRITERIA:
NON GOALS:
BREADCRUMBS:
Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.
What
Merge upstream kube-rbac-proxy v0.18.1 into downstream.
Why
We need to update the deps to get rid of CVE issues.
Clean up GCPLabelsTags feature gate created for OCPSTRAT-768 feature. Feature was made available as TechPreview in 4.14 and GA in 4.17.
GCPLabelsTags feature gate validation checks should be removed in installer, operator and API.
FeatureGate check added in installer for userLabels and userTags should be removed and the reference made in the install-config GCP schema should be removed.
Acceptance Criteria
GCPLabelsTags feature gate check added in machine-api-provider-gcp operator for userLabels and userTags.
And the featureGate added in openshift/api should also be removed.
Acceptance Criteria
To ensure the NUMA Resources Operator can be deployed, managed, and utilized effectively within HyperShift hosted OpenShift clusters.
The NUMA resources operator enables NUMA-aware pod scheduling. As HyperShift gains popularity as a cost-effective and portable OpenShift form factor, it becomes important to ensure that the NUMA Resources Operator, like other operators, is fully functional in this environment. This will enable users to leverage NUMA aware pod scheduling, which is important for low-latency and high performance workloads like telco environments.
Deploying the NUMA Resources Operator on a HyperShift hosted OpenShift cluster.
Ensure the operands run correctly on a HyperShift hosted OpenShift cluster.
Pass the e2e test suite on Hypershift hosted OpenShift cluster
NROP need access to the KubeletConfig so it could pass the TopologyManager policy to RTE.
This story implements: https://github.com/openshift/enhancements/blob/master/enhancements/hypershift/topology-aware-scheduling/topology-aware-scheduling.md
Previously, when integrating 1_performance_suite, we faced an issue with a test case Number of CPU requests as multiple of SMT count allowed when HT enabled failure, which occurred due to the reason the testpod failed to be admitted (API server couldn't find a node with the worker-cnf label).
We started to investigate this as we couldn't find how and who is adding the worker-cnf label to the podspec. since we couldn't figure that out, a workaround we introduced was to reapply the worker-cnf label to the worker nodes after each tuning update.
Another thing we were curious about is why the node lost its labels after the performance profile application. We believe this relates to the nodepool rollingUpdate policy (upgradeType: Replace) that replaces the nodes when tuning configuration changes.
**
This issue will track the following items:
1. An answer for how the worker-cnf label was added to the testpod.
2. Check with hypershift folks if we can change the nodepool rollingUpdate policy to Inplace, for our CI tests, and discuss the benefits/drawbacks.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
We should make the upstream test suite run with the single hostnetworked frr d/s like we do with MetalLB.
It can boil down to running those tests as part of the same frr-k8s lane where we test metallb
Address miscellaneous technical debt items in order to maintain code quality, maintainability, and improved user experience.
Role | Contact |
---|---|
PM | Peter Lauterbach |
Documentation Owner | TBD |
Delivery Owner | (See assignee) |
Quality Engineer | (See QA contact) |
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue | <link to GitHub Issue> |
DEV | Upstream code and tests merged | <link to meaningful PR or GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR or GitHub Issue> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
The PR https://github.com/openshift/origin/pull/25483 introduced a report which infers a storage driver's virtualization compatibility by post-processing the openshift-tests results. Unfortunately this doesn't provide an accurate enough picture about CNV compatibility and thus we now have and promote the kubevirt-storage-checks. Avoid sending mixed messages and revert this post-processor from openshift-tests.
This epic is to track any stories for hypershift kubevirt development that do not fit cleanly within a larger effort.
Here are some examples of tasks that this "catch all" epic can capture
Users need the ability to set labels on the HostedCluster in order to influence how MCE installs addons into that cluster.
In MCE, when a HostedCluster is created, MCE imports that cluster as a ManagedCluster. MCE has the ability to install addons into ManagedClusters by matching a managedCluster to an install strategy using label selectors. During the import process of importing a HostedCluster as a ManagedCluster, MCE now syncs the labels from the HostedCluster to the ManagedCluster.
This means by being able to set the labels on the HostedCluster, someone can now influence what addons are installed by MCE into that cluster.
Location:
PF component:
AC: Replace react-copy-to-clipboard with PatternFly ClipboardCopy component.
ContainerDropdown
frontend/packages/dev-console/src/components/health-checks/AddHealthChecks.tsx
frontend/public/components/environment.jsx
frontend/public/components/pod-logs.jsx
Move shared Type definitions for CreateSecret to "createsecret/type.ts" file
A.C.
- All CreateSecret components shared Type definitions are in "createsecret/type.ts" file
The SSHAuthSubform component needs to be refactored to address several tech debt issues: * Rename to SSHAuthSecretForm
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
As part of the spike to determine outdated plugins, the monaco-editor dev dependency is out of date and needs to be updated.
Acceptance criteria:
Need to follow the steps in https://webpack.js.org/migrate/5/#upgrade-webpack-4-and-its-pluginsloaders in order to migrate to Webpack v5'
Acceptance criteria:
As a developer, I want to take advantage of the `status` prop that was introduced in PatternFly 5.3.0, so that I can use it for stories such as ODC-7655, which need it for form validation
AC:
Content-Security-Policy (CSP) header provides a defense-in-depth measure in client-side security, as a second layer of protection against Cross-site Scripting (XSS) and clickjacking attacks.
It is not yet implemented in the OpenShift web console, however, there are some other related security headers present in the OpenShift console that cover some aspects of CSP functionality:
This story follows up on spike https://issues.redhat.com/browse/CONSOLE-4170
The aim of this story is to add initial CSP implementation for Console web application that will use Content-Security-Policy-Report-Only HTTP header to report on CSP violations.
CSP violations should be handled directly by Console code via custom SecurityPolicyViolationEvent handler, which logs the relevant CSP violation data to browser console.
AC:
CSP violations caused by dynamic plugins should trigger a warning within the cluster dashboard / dynamic plugin status.
AC:
We should add a custom ConsolePlugin details page that shows additional plugin information as well as controls (e.g. enable/disable plugin) for consistency with ConsolePlugin list page.
AC:
CONSOLE-4265 introduced additional ConsolePlugin CRD field for CSP configuration, so plugins can provide their own list of allowed source. Console-operator needs to vendor this changes and also provide a way how to configure the default CSP directives.
AC:
When serving Console HTML index page, we generate the policy that includes allowed (trustworthy) sources.
It may be necessary for some dynamic plugins to add new sources in order to avoid CSP violations at Console runtime.
AC:
Console HTML index template contains an inline script tag used to set up SERVER_FLAGS and visual theme config.
This inline script tag triggers a CSP violation at Console runtime (see attachment for details).
The proper way to address this error is to allow this script tag - either generate a SHA hash representing its contents or generate a cryptographically secure random token for the script.
AC:
As part of the AI we would like to supply/generate a manifest file that will install:
Add to assisted installer an option to install MTV operator post cluster installation
: [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
failed log
[sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:499 STEP: Creating a kubernetes client @ 08/12/24 15:55:02.255 STEP: Building a namespace api object, basename dns @ 08/12/24 15:55:02.257 STEP: Waiting for a default service account to be provisioned in namespace @ 08/12/24 15:55:02.517 STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 08/12/24 15:55:02.581 STEP: Creating a kubernetes client @ 08/12/24 15:55:02.646 Aug 12 15:55:03.941: INFO: configPath is now "/tmp/configfile2098808007" Aug 12 15:55:03.941: INFO: The user is now "e2e-test-dns-dualstack-9bgpm-user" Aug 12 15:55:03.941: INFO: Creating project "e2e-test-dns-dualstack-9bgpm" Aug 12 15:55:04.299: INFO: Waiting on permissions in project "e2e-test-dns-dualstack-9bgpm" ... Aug 12 15:55:04.632: INFO: Waiting for ServiceAccount "default" to be provisioned... Aug 12 15:55:04.788: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Aug 12 15:55:04.972: INFO: Waiting for ServiceAccount "builder" to be provisioned... Aug 12 15:55:05.132: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Aug 12 15:55:05.213: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Aug 12 15:55:05.281: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Aug 12 15:55:05.641: INFO: Project "e2e-test-dns-dualstack-9bgpm" has been fully provisioned. STEP: creating a dual-stack service on a dual-stack cluster @ 08/12/24 15:55:05.775 STEP: Running these commands:for i in `seq 1 10`; do [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "172.31.255.230" ] && echo "test_endpoints@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "fd02::7321" ] && echo "test_endpoints_v6@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv4.v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "3.3.3.3 4.4.4.4" ] && echo "test_endpoints@ipv4.v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv6.v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "2001:4860:4860::3333 2001:4860:4860::4444" ] && echo "test_endpoints_v6@ipv6.v4v6.e2e-dns-2700.svc";sleep 1; done @ 08/12/24 15:55:05.935 STEP: creating a pod to probe DNS @ 08/12/24 15:55:05.935 STEP: submitting the pod to kubernetes @ 08/12/24 15:55:05.935 STEP: deleting the pod @ 08/12/24 16:00:06.034 [FAILED] in [It] - github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074 STEP: Collecting events from namespace "e2e-test-dns-dualstack-9bgpm". @ 08/12/24 16:00:06.074 STEP: Found 0 events. @ 08/12/24 16:00:06.207 Aug 12 16:00:06.239: INFO: POD NODE PHASE GRACE CONDITIONS Aug 12 16:00:06.239: INFO: Aug 12 16:00:06.334: INFO: skipping dumping cluster info - cluster too large Aug 12 16:00:06.469: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-dns-dualstack-9bgpm-user}, err: <nil> Aug 12 16:00:06.506: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-dns-dualstack-9bgpm}, err: <nil> Aug 12 16:00:06.544: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~4QgFXAn8lyosshoHOjJeddr3MJbIL2DnCsoIvJVOGb4}, err: <nil> STEP: Destroying namespace "e2e-test-dns-dualstack-9bgpm" for this suite. @ 08/12/24 16:00:06.544 STEP: dump namespace information after failure @ 08/12/24 16:00:06.58 STEP: Collecting events from namespace "e2e-dns-2700". @ 08/12/24 16:00:06.58 STEP: Found 2 events. @ 08/12/24 16:00:06.615 Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: skip schedule deleting pod: e2e-dns-2700/dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30 Aug 12 16:00:06.648: INFO: POD NODE PHASE GRACE CONDITIONS Aug 12 16:00:06.648: INFO: Aug 12 16:00:06.743: INFO: skipping dumping cluster info - cluster too large STEP: Destroying namespace "e2e-dns-2700" for this suite. @ 08/12/24 16:00:06.743 • [FAILED] [304.528 seconds] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:499 [FAILED] Failed: timed out waiting for the condition In [It] at: github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074 ------------------------------ Summarizing 1 Failure: [FAIL] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/dns/dns.go:251 Ran 1 of 1 Specs in 304.528 seconds FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped fail [github.com/openshift/origin/test/extended/dns/dns.go:251]: Failed: timed out waiting for the condition Ginkgo exit error 1: exit with code 1
failure reason
TODO
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a ARO HCP user, I want to be able to:
so that I can remove
Description of criteria:
Detail about what is specifically not being delivered in the story
These are the CRs that need to are manually installed today
oc apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml oc apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml oc apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml oc apply -f https://raw.githubusercontent.com/openshift/api/master/route/v1/zz_generated.crd-manifests/routes-Default.crd.yaml
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
$ hypershift --help {"level":"error","ts":"2024-11-05T09:26:54Z","logger":"controller-runtime.client.config","msg":"unable to load in-cluster config","error":"unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must bedefined", ... ERROR Failed to get client {"error": "unable to get kubernetes config: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable"}
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Cant run hypershift --help without kubeconfig
Expected results:
Can run hypershift --help without kubeconfig
Additional info:
Currently, if we don't specify the NSG ID or VNet ID the CLI will create these for us in the managed RG. In prod ARO these will be in separate RGs as they will be provided by the customer, we should reflect this in our env.
This will also make the AKS e2e simpler as the jobs won't have to create these resource groups for each cluster.
Steps to Reproduce:
1. Run any hypershift CLI command in an environment without a live cluster e.g. hypershift create cluster --help 2024-10-30T12:19:21+08:00 ERROR Failed to create default options {"error": "failed to retrieve feature-gate ConfigMap: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://ci-op-68zb-ci-op-68zbrc3h-2-53b8f5-qycsv9k7.hcp.northcentralus.azmk8s.io:443/api/v1\": dial tcp: lookup ci-op-68zb-ci-op-68zbrc3h-2-53b8f5-qycsv9k7.hcp.northcentralus.azmk8s.io: no such host"} github.com/openshift/hypershift/cmd/cluster/azure.NewCreateCommand /Users/fxie/Projects/hypershift/cmd/cluster/azure/create.go:480 github.com/openshift/hypershift/cmd/cluster.NewCreateCommands /Users/fxie/Projects/hypershift/cmd/cluster/cluster.go:36 github.com/openshift/hypershift/cmd/create.NewCommand /Users/fxie/Projects/hypershift/cmd/create/create.go:20 main.main /Users/fxie/Projects/hypershift/main.go:64 runtime.main /usr/local/go/src/runtime/proc.go:271 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x10455ea8c] goroutine 1 [running]: github.com/spf13/cobra.(*Command).AddCommand(0x1400069db08, {0x14000d91a18, 0x1, 0x1}) /Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1311 +0xbc github.com/openshift/hypershift/cmd/cluster.NewCreateCommands() /Users/fxie/Projects/hypershift/cmd/cluster/cluster.go:36 +0x4c4 github.com/openshift/hypershift/cmd/create.NewCommand() /Users/fxie/Projects/hypershift/cmd/create/create.go:20 +0x11c main.main() /Users/fxie/Projects/hypershift/main.go:64 +0x368
Actual results:
panic: runtime error: invalid memory address or nil pointer dereference
This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense.
If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.
This is a follow up story for: https://issues.redhat.com/browse/OCPBUGS-7836
The pivot command currently prints an error message and warns the user that it will be removed soon. We are planning to land this in 4.15.
This story will be complete when:
tracking here all the work that needs to be done to configure the ironic containers (ironic-image and ironic-agent-image) to be ready for OCP 4.19
this includes also CI configuration, tools and documentation updates
all the configuration bits need to happen at least one sprint BEFORE 4.19 branching (current target November 22 2024)
docs tasks can be completed after the configuration tasks
the CI tasks need to be completed RIGHT AFTER 4.19 branching happens
tag creation is now automated during OCP tags creation
builder creation has been automated
before moving forward with 4.19 configuration, we need to be sure that the dependencies versions in 4.18 are correctly aligned with the latest upper-constraints
Tools that we use to install python libraries in container move much faster than the corresponding package built for the operating system
In the latest version, sushy uses now pyproject.toml specifying pbr and setuptools as "build requirements" and using pbr as "build engine"
Because of this, due to PEP 517 and 518, pip will use an isolated environment to build the package, blocking the usage of system installed packages as dependencies.
We need to either install pbr, setuptools and wheel from source including them in the pip isolated build environment, or use the "no-build-isolation" pip option to allow using system installed build packages
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.
yes
Description of the problem:
Since machine networks are computed at installation time in case of UMN (in the right way), the validation no-iscsi-nic-belongs-to-machine-cidr should be skipped in this case.
We should also skip this validation in case of day2 and imported clusters, because the cluster are not created with all the network information that make his validation work.
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
In order to successfully install OCP on an iSCSI boot volume, we need to make sure that the machine has 2 network interfaces:
This is required because on startup OVS/OVN will reconfigure the default interface (the network interface used for the default gateway). This behavior makes the usage of the default interface impracticable for the iSCSI traffic because we loose the root volume, and the node becomes unusable. See https://issues.redhat.com/browse/OCPBUGS-26071
In the scope of this issue we need to:
In case of CMN or SNO, the user should not be able to select the subnet used for the iSCSI traffic.
Historically, assisted-service has only allowed one mirror configuration that would be applied to all spoke clusters. This was done for assisted service to pull the images needed to install OCP on the spoke cluster. The mirror was then copied over to the spoke cluster.
Feature request: Allow each cluster to have its own mirror configuration
Use-case: This came out of the Sylva CAPI project where they have a pull-through proxy that caches images from docker. Each spoke cluster created might not have a connectivity on the same network so they will need different mirror configurations per cluster created.
The only way to do this right now is using an install config override for every cluster. https://github.com/openshift/assisted-service/blob/master/docs/user-guide/cloud-with-mirror.md
Add per cluster support in AgentCLusterInstall and update ImageDigestSource in the install config the same as we are doing in per service mirror registry
The Nmstate is included into RHEL CoreOS 4.14+ providing `nmstate.service` which apply the YAML files into `/etc/nmstate/` folder. Currently, assisted installer(maybe other OCP install methods also) is using `nmstatectl gc` to generate NetworkManager keyfiles.
Benefit of using `nmstate.service`:
1. No need to generate keyfiles anymore.
2. `nmstate.service` is providing nmpolicy support, for example, below YAML is nmpolicy creating a bond with ports holding specified MAC address without knowing the interface names.
capture: port1: interfaces.mac-address=="00:23:45:67:89:1B" port2: interfaces.mac-address=="00:23:45:67:89:1A" desiredState: interfaces: - name: bond0 type: bond state: up link-aggregation: mode: active-backup ports: - "{{ capture.port1.interfaces.0.name }}" - "{{ capture.port2.interfaces.0.name }}"
3. Follow up day1 and day2 network configuration tools could look up `/etc/nmstate` to understand the network topology created in day0.
4. Fallback support with verification. For example, we can have `00-fallback.yml` holding fallback network setup and `01-install.yml` holding user defined network. Nmstate will apply them sequentially, if `01-install.yml` failed the verification check of nmstate, nmstate will rollback to `00-fallback.yml` state.
Please describe what this feature is going to do.
Installer use nmstate.service without deploying NetworkManager keyfiles.
Please describe what conditions must be met in order to mark this feature as "done".
Document could mention:
If the answer is "yes", please make sure to check the corresponding option.
Not customer related
Not from architect
Gris Ge <fge@redhat.com>, maintainer of nmstate.
This is internal processing of network setup.
For ISOs that have nmstate binary use nmpolicy + nmstate service instead of pre generating the nmconnection files and the script.
This is a new task that must be included in all tekton pipeline by November 1st.
We need to add this task to the following components:
Configure assisted-service to use the image that is built by Konflux.
The epic should contain tasks that ease the process of handling security issues.
Description of the problem:
Dependabot can't merge PRs as it doesn't tidy and vendor other modules.
for example - https://github.com/openshift/assisted-service/pull/6595
It seems like the reason causing it is that dependabot only updates one module at a time, then if a package is bumped in module A and module B requires module A, then dependabot should bump this package in module B as well, which currently not happening.
We want to male sure dependabot is bumping all required versions across all branches/repositories
How reproducible:
Almost each PR
Actual results:
Failing jobs on dependabot PRs
Expected results:
Dependabot bumping dependencies successfully
The Assisted Installer should support backup/restore and disaster recovery scenarios, either using OADP (OpenShift API for Data Protection) for ACM (Advanced Cluster Management), or, using ZTP (Zero Touch Provisioning) flows. I.e. the assisted-service should be resilient in such scenarios which, for this context and effort, means that restored/moved spoke clusters should keep the same state and behave the same on the new hub cluster.
Provide resiliency in the assisted-service for safe backup/restore flows, allowing spoke clusters to be used without any restriction after DR scenarios or moving between hubs.
TBD
Document outlining issues and potential solutions: https://docs.google.com/document/d/1g77MDYOsULHoTWtjjpr7P5_9L4Fsurn0ZUJHCjQXC1Q/edit?usp=sharing
Backup and restore managed (hosted) clusters installed with hosted control planes with the agent platform (assisted-service).
Yes
During the Govtech spike [1]: backup and restore of HCP clusters from one ACM hub to a new ACM hub, it was discovered that the data currently saved for the first iteration [2] of restoring a host isn't enough.
After restoring, the NodePool, Machine, and AgentMachine still showed they were unready and that they were unable to adopt the Nodes. The Agents were completely missing their statuses, which is likely to have caused this.
We'll need to uncover all the issues and all that needs to be saved in order for the restore to complete successfully.
—
[1] HOSTEDCP-2052 - Slack thread
[2] MGMT-18635
Agents need to have their inventory and state restored in the status in order for the NodePool to completely re-adopt the Nodes on restore.
Discovered in https://redhat-internal.slack.com/archives/C07S20C4SHX/p1729874447167699?thread_ts=1729272921.555049&cid=C07S20C4SHX
Currently, we have few issues with our current OCM authorization -
assisted service rhsso auth type is aligned with OCM
Currently, deployment of assisted-installer using authentication mode "rhsso" doesn't work properly, we need to fix this type of deployment to test it
When an Assisted Service SaaS user performs the creation of a new OpenShift cluster, provide the option to enable the Migration Kit for Virtualization (MTV) operator.
Description of the problem:
when creating a SNO Cluster
The U blocks user to select MTV operator
How reproducible:
Steps to reproduce:
1. create sno cluster 4.17
2.go to operator page
3.
Actual results:
operator mtv is disabled and can not be selected
Expected results:
should be selected
Allow users to do a basic OpenShift AI intallation with one click in the "operators" page of the cluster creation wizard, similar to how the ODF or MCE operators can be installed.
This feature will be done when users can click on the "OpenShift AI" check box on the operators page of the cluster creation wizard, and end having an installation that can be used for basic tasks.
Yes.
Feature origin (who asked for this feature?)
In order to complete the setup of some operators it is necessary to do things that can't be done creating a simple manifest. For example, in order to complete the setup of ODF so that it can be used by OpenShift AI it is necessary to configure the default storage class, and that can be done with a simple manifest.
One possible way to overcome that limitation is to create a simple manifest that contains a job, so that the job will execute the required operation. In the example above the job will run something like this:
oc annotate storageclass ocs-storagecluster-ceph-rbd storageclass.kubernetes.io/is-default-class=true
Doing that is already possible, but the problem is that the assisted installer will not wait for these jobs to complete before declaring that the cluster is ready. The intent of this ticket is to change the installer so that it will wait.
Add to assisted installer the infrastructure to install the OpenShift AI operator.
This has no link to a planing session, as this predates our Epic workflow definition.
Integrate CP test suite into Prow to display a 99% passing history for enabling this by default: https://github.com/openshift/api/pull/1815/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R9.
This epic is to track stories that are not completed in MON-3537
No need to fall back to prometheus-adapter.
Remove the feature gate Metrics Server as default has been switched to metrics-server itself now and there is no option to install the alternative prometheus-adapter
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image API with respect to importing imagestreams images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.
There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:
Some open questions:
This change enables the setting of import mode through the image config API which is then synced to the apiserver's observed config which then enables apiserver to set the import mode based on this value. The import mode in the observed config is also populated by default based on the payload type
poc: https://github.com/Prashanth684/api/commit/c660fba709b71a884d0fc96dd007581a25d2d17a
Following MULTIARCH-4556, the importmode needs to be synced from the image config. If not present, it should be inferred from the CVO which should provide status on whether payload is multi or single.
For the apiserver operator to figure out the payload type and set the import mode defaults, the CVO needs to expose that value through the status field. This information is available today in the conditions list, but it's not pretty to extract it and infer the payload type as it is contained in the message string. The way to do it today is shown here. It would be better for CVO to expose it as a separate field which can be easily consumed by any controller and also be used for telemetry in the future.
Track improvements to IPI on Power VS made in the 4.18 release cycle.
Compare the list of Power VS zones that have PER enabled [0] with the list of zones we offer in the installer [1].
Add any regions that have not been added and check that any hardware type offered in that region is added to the list as well. For example, dal10 has s1022 but we do not expose that in the installer.
[0] https://cloud.ibm.com/docs/power-iaas?topic=power-iaas-per#dcs-per
When SNAT disabled, the only way to reach the needed private endpoints in Power VS is through a Virtual Private Endpoint. Once these are created in the VPC you are connected to, you'll be able to reach the endpoints.
If we are provisioning a disconnected cluster, ensure the COS, DNS, and IAM VPEs are created.
Bump vendored Kubernetes packages (k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, etc.) to v0.31.0 or newer version.
Keep vendored packages up to date.
Additional information on each of the above items can be found here: Networking Definition of Planned
1. Other vendored dependencies (such as openshift/api and controller-runtime) may also need to be updated to Kubernetes 1.31.
1. We tracked these bumps as bugs in the past. For example, for OpenShift 4.17 and Kubernetes 1.30: OCPBUGS-38079, OCPBUGS-38101, and OCPBUGS-38102.
None.
The openshift/cluster-ingress-operator repository vendors k8s.io/* v0.30.2. OpenShift 4.18 is based on Kubernetes 1.31.
4.18.
Always.
Check https://github.com/openshift/cluster-ingress-/blob/release-4.18/go.mod.
The k8s.io/* packages are at v0.30.2.
The k8s.io/* packages are at v0.31.0 or newer.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
OCPCLOUD-2514 prevented feature gates from being used with the CCMs.
We have been asked not to remove the feature gates themselves until 4.18.
PR to track: https://github.com/openshift/api/pull/1780
We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.
None
ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17. In the console UI, we have a ClusterTask list page, and ClusterTasks are also listed in the Tasks quick search in the Pipeline builder form.
Remove ClusterTask and references from the console UI and use Tasks from `openshift-pipelines` namespace.
Resolver in Tekton https://tekton.dev/docs/pipelines/resolution-getting-started/
Task resolution: https://tekton.dev/docs/pipelines/cluster-resolver/#task-resolution
Description of problem:
In local after setting flags we can see the Community tasks. After the change with the pr, cluster tasks are removed and community tasks can't be seen even after setting the flag
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Set the window.SERVER_FLAGS.GOARCH and window.SERVER_FLAGS.GOOS 2. Go to the pipeline builder page 3.
Actual results:
You can't see any tasks
Expected results:
Community tasks should appear after setting the flag
Additional info:
ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17
We have to use Tasks from `openshift-pipelines` namespace. This change will happen in console-plugin repo(dynamic plugin). So in console repository we have to remove all the dependency of ClusterTask if the Pipelines Operator is 1.17 and above
Description of problem:
Add a flag to disallowed the Pipeline edit URL in console pipelines-plugin so that it will not conflict between the console and Pipelines console-plugin
Description of problem:
Add disallowed flag to hide the pipelines-plugin pipeline builder route, add action and to catalog provider extension as it is migrated to Pipelines console-plugin. So, that no duplicate action in console
Networking Definition of Planned
Epic Template descriptions and documentation
Simplify the resolv-prepender process to eliminate consistently problematic aspects. Most notably this will include reducing our reliance on the dispatcher script for proper configuration of /etc/resolv.conf by replacing it with a systemd watch on the /var/run/NetworkManager/resolv.conf file.
Over the past five years or so of on-prem networking, the resolv-prepender script has consistently been a problem. Most of these problems relate to the fact that it is triggered as a NetworkManager dispatcher script, which has proven unreliable, despite years of playing whack-a-mole with various bugs and misbehaviors. We believe there is a simpler, less bug-prone way to do this that will improve both the user experience and reduce the bug load from this particular area of on-prem networking.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
As a followup to https://issues.redhat.com/browse/OPNET-568 we should remove the triggering of resolv-prepender from the dispatcher script. There is still some other functionality in the dispatcher script that we need to keep, but once we have the systemd watch it won't be necessary to trigger the service from the script at all.
This is being tracked separately because it is a more invasive change than just adding the watch, so we probably don't want to backport it.
Networking Definition of Planned
Epic Template descriptions and documentation
Goal is to create a set of E2E test (in o/origin repository) testing keepalived and haproxy.
Based on the past experience, implementing this will be extremely helpful to various teams when debugging networking (and not only) issues.
Network stack is complex and currently debugging keepalived relies mostly on parsing log lines.
Additional information on each of the above items can be found here: Networking Definition of Planned
-
-
-
We had an incident (outage) in the past where OSUS impacted other applications running in that multi tenant environment along with itself. Refer [1][2] for more details.
We initially created all Jira cards as part of OTA-552. But the epic grew very large. So moving some cards to this epic. The associated Jira cards which are created to improve the ability of OSUS to handle more requests without causing issues with other application in an multi-tenant environment.
Update advice is append-only, with 4.y.z releases being added to channels regularly, and new update risks being declared occasionally. This makes caching a very safe behavior, and client-side caching in the CVO would reduce the disruption caused by OpenShift Update Service (OSUS) outages like OTA-1376.
A single failed update-service retrieval currently clears the cache in 4.18. The code is pretty old, so I expect this behavior goes back through 4.12, our oldest release that's not yet end-of-life.
Every time.
1. Run a happy cluster with update advice.
2. Break the update service, e.g. by using OTA-520 for a mock update service.
3. Wait a few minutes for the cluster to notice the breakage.
4. Check it's update recommendations, with oc adm upgrade or the new-in-4.18 oc adm upgrade recommend.
No recommendations while the cluster is RetrievedUpdates=False.
Preserving the cached recommendations while the cluster is RetrievedUpdates=False, at least for 24 hours. I'm not committed to a particular time, but 24h is much larger than any OSUS outage we've ever had, and still not so long that we'd expect much in the way of recommendation changes if the service had remained healthy.
In order to provide customers the option to process alert data externally, we need to provide a way the data can be downloaded from the OpenShift console. The monitoring plugin uses a Virtualized table from the dynamic plugin SDK. We should include the change in this table so is available for others.
---
NOTE:
There is a duplicate issue in the OpenShift console board: https://issues.redhat.com//browse/CONSOLE-4185
This is because the console > CI/CD > prow configurations require that any PR in the openshift/console repo needs to have an associated Jira issue in the openshift console Jira board.
Given hostedcluster:hypershift_cluster_vcpus:max now exists, we need to use it to derive a vCPU-hours metric.
Related slack thread: https://redhat-internal.slack.com/archives/C0493H149DK/p1719329224733099?thread_ts=1719252265.181669&cid=C0493H149DK
Draft recording rule:
record: hostedcluster:hypershift_cluster_vcpus:vcpu_hours expr: max by(_id)(count_over_time(hostedcluster:hypershift_cluster_vcpus:max[1h:5m])) / scalar(count_over_time(vector(1)[1h:5m]))
In order to simplify querying a rosa cluster's effective CPU hours, create a consolidated metric for rosa vcpu-hours
Related slack thread: https://redhat-internal.slack.com/archives/C0493H149DK/p1719329224733099?thread_ts=1719252265.181669&cid=C0493H149DK
Draft recording rule:
record: rosa:cluster:vcpu_hours expr: (hostedcluster:hypershift_cluster_vcpus:vcpu_hours or on (_id) cluster:usage:workload:capacity_virtual_cpu_hours)
Covers all other tech debt stories targeted for 4.16
OKD update the samples more or less independently from OCP. It would be good to add support for this in library-sync.sh so that OCP and OKD don't "step on each others' toes" when doing the updates.
library-sync.sh should accept a parameter, say --okd, that when set will update only the OKD samples (all of them, because we don't have unsupported samples in OKD) and when not set, it will update the supported OCP samples.
We need to bump the Kubernetes Version. To the latest API version OCP is using.
This what was done last time:
https://github.com/openshift/cluster-samples-operator/pull/409
Find latest stable version from here: https://github.com/kubernetes/api
This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
The samples need to be resynced for OCP 4.18. Pay attention to only update the OCP samples. OKD does this independently.
Note that the rails templates are currently out-of-sync of the upstream so care needs to be taken to not mess those up and adopt the upstream version again.
ATTENTION: this card is blocked by SPLAT-1158 (implementing the workflow proposed in https://docs.providers.openshift.org/platform-external )
As a followup to SPLAT-1158 and SPLAT-1425, we should create an cluster with platform type "External" and workflows/steps/jobs that run on vSphere infrastructure using regular OpenShift CI e2e workflow, using the provisioning steps proposed in the docs.providers (https://docs.providers.openshift.org/platform-external).
There are currently a few platform "External" steps (install) that are associated with vSphere, but supposedly only OPCT (need more investigation) conformance workflow is using it.
in the ci-operator, these should be used as reference for building a new test that will deploy OpenShift on vSphere using platform "External" with and without CCM. This will be similar to the vSphere platform "None" (and platform "External" from SPLAT-1782.
Caveats:
Currently there is a workflow "upi-vsphere-platform-external-ccm" but it isn't used for any jobs. In other hand, there are a few workflows on OPCT Conformance using the step "upi-vsphere-platform-external-ovn-pre" to install a cluster on vSphere using platform type external.
Recently in SPLAT-1425 the regular e2e step incorporated support to platform external type, we need to create an workflow consuming the default OCP CI e2e workflow to get signals using the same workflow as the other platforms, and engineers are familiar.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
*USER STORY:*
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
*DESCRIPTION:*
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
*Required:*
...
*Nice to have:*
...
*ACCEPTANCE CRITERIA:*
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
*ENGINEERING DETAILS:*
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update OCP release number in OLM metadata manifests of:
OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56
We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
Please wait for openshift/api, openshift/library-go, and openshift/client-go are updated to the newest Kubernetes release! There may be non-trivial changes in these libraries.
This includes (but is not limited to):
Operators:
(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)
EOL, do not upgrade:
The following operators were migrated to csi-operator, do not update these obsolete repos:
tools/library-bump.py and tools/bump-all may be useful. For 4.16, this was enough:
mkdir 4.16-bump cd 4.16-bump ../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16"
4.17 perhaps needs an older prometheus:
../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17"
4.18 special:
Add "spec.unhealthyEvictionPolicy: AlwaysAllow" to all PodDisruptionBudget objects of all our operators + operands. See WRKLDS-1490 for details
There has been change in library-go function called `WithReplicasHook`. See https://github.com/openshift/library-go/pull/1796.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
This epic is part of the 4.18 initiatives we discussed, it includes:
Once we have an MVP of openshift-tests-extension, migrate k8s-tests in openshift/kubernetes to use it.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
When IBM Cloud Infrastructure bugs/outages prevent proper cleanup of resources, it can prevent the deletion of the Resource Group during cluster destroy. The errors returned because of this is not always helpful and can be confusing.
Version-Release number of selected component (if applicable):
4.16 (and earlier)
How reproducible:
80% when IBM Cloud Infrastructure experiences issues
Steps to Reproduce:
1. When there is a know issue with IBM Cloud Infrastructure (COS, Block Storage, etc.), create an IPI cluster on IBM Cloud 2. Destroy the cluster
Actual results:
WARNING Failed to delete resource group us-east-block-test-2-d5ssx: Resource groups with active or pending reclamation instances can't be deleted. Use the CLI commands "ibmcloud resource service-instances --type all" and "ibmcloud resource reclamations" to check for remaining instances, then delete the instances and try again.
Expected results:
More descriptive details on the blocking resource service-instances (not always storage reclamation related). Potentially something helpful to provide to IBM Cloud Support for assistance.
Additional info:
IBM Cloud is working on a PR to help enhance the debug details when these kind of errors occur. At this time, an ongoing issue, https://issues.redhat.com/browse/OCPBUGS-28870, is causing these failures, where this additional debug information can help identify and guide IBM Cloud Support to resolve. But this information does not resolve that bug (which is an Infrastructure bug).
Description of problem:
The created Node ISO is missing the architecture (<arch>) in its filename, which breaks consistency with other generated ISOs such as the Agent ISO.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Actual results:
Currently, the Node ISO is being created with the filename node.iso.
Expected results:
Node ISO should be created as node.<arch>.iso to maintain consistency.
Description of problem:
The network-status annotation includes multiple default:true entries for OVN's UDN
Version-Release number of selected component (if applicable):
4.17+
How reproducible:
Always
Steps to Reproduce:
1. Use UDN 2. View network-status annotation, see multiple default:true entries
Actual results:
multiple default:true entries
Expected results:
single default:true entries
Description of problem:
On route create page, the Hostname has id "host", and Service name field has id "toggle-host", which should be "toggle-service".
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-13-193731
How reproducible:
Always
Steps to Reproduce:
1.Check hostname and service name elements for route creation page, 2. 3.
Actual results:
1. Service name field has id "toggle-host". screenshot: https://drive.google.com/file/d/1qkUhhzUPsfFw_o2Gj8XXr9QCISH3g1rK/view?usp=drive_link
Expected results:
1. The id should be "toggle-service".
Additional info:
Description of problem:
user is unable to switch to other projects successfully on network policies list page
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-27-051932
How reproducible:
Always
Steps to Reproduce:
1. cluster-admin or normal user visit network policies list page via Networking -> NetworkPolicies 2. open project dropdown and choose different project 3.
Actual results:
2. user is unable to switch to other project successfully
Expected results:
2. user should be able to switch project any time project is changed
Additional info:
Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1730
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
During the build02 update from 4.14.0-ec.1 to ec.2 I have noticed the following:
$ b02 get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")' { "lastTransitionTime": "2023-06-20T13:40:12Z", "message": "Multiple errors are preventing progress:\n* Cluster operator authentication is updating versions\n* Could not update customresourcedefinition \"alertingrules.monitoring.openshift.io\" (512 of 993): the object is invalid, possibly due to local cluster configuration", "reason": "MultipleErrors", "status": "True", "type": "Failing" }
There is a valid error (the Could not update customresourcedefinition... one) but the whole thing is cluttered by the "Cluster operator authentication is updating versions" message, which is imo not a legit reason for Failing=True condition and should not be there. Before I captured this one I saw the message with three operators instead of just one.
Version-Release number of selected component (if applicable):
4.14.0-ec.2
How reproducible:
No idea
Description of problem:
When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml: additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: ci.devcluster.openshift.com compute: - architecture: arm64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: arm64 hyperthreading: Enabled name: master platform: {} replicas: 3 However, the installation will fail with ambiguous error messages: ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused The actual error hides in the bootstrap VM's System Log: Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17 SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA) SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519) SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA) ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac Ignition: ran on 2024/08/14 12:34:24 UTC (this boot) Ignition: user-provided config was applied [0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m [1;31mRelease image arch amd64 does not match host arch arm64[0m ip-10-29-3-15 login: [ 89.141099] Warning: Unmaintained driver is detected: nft_compat
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Use amd64 installer to install a cluster with aarch64 nodes
Steps to Reproduce:
1. download amd64 installer 2. generate the install-config.yaml 3. edit install-config.yaml to use aarch64 nodes 4. invoke the installer
Actual results:
installation timed out after ~30mins
Expected results:
installation failed immediately with proper error message indicating the installation is not possible
Additional info:
https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379
Description of problem:
"Edit Route" from action list doesn't support Form edit.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-21-014704 4.17.0-rc.5
How reproducible:
Always
Steps to Reproduce:
1.Go to one route detail page, click "Edit Route" from action dropdown list. 2. 3.
Actual results:
1. It opens YAML tab directly.
Expected results:
1. Should support both Form and YAML edit.
Additional info:
Description of problem:
A slice of something like idPointers := make([]*string, len(ids)) should be corrected to idPointers := make([]*string, 0, len(ids)) When the initial size is not provided to the make for slice creating, the slice is made to length (last argument) and filled with the default value. For instance _ := make([]int, 5) creates an array {0, 0, 0, 0, 0}. If this appended to rather than accessing and setting the information by index, then there are extra values. 1. If we append to the array then we leave behind the default values (this could change the behavior of the function that the array is passed to). This could also pose as a malloc issue. 2. If we dont fill the array completely (ie. create a size of 5 and only fill 4 elements), then the same issue as above could come in to play.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Placeholder for bumping CAPO in the installer.
Test with dynamic namespaces in the name break aggregation (and everything else):
: [sig-architecture] platform pods in ns/openshift-must-gather-8tbzj that restart more than 2 is considered a flake for now
It's only finding 1 of that test and failing aggregation.
Description of problem:
container_network* metrics disappeared from pods
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-13-031847
How reproducible:
always
Steps to Reproduce:
1.create a pod 2.check container_network* metrics from the pod $oc get --raw /api/v1/nodes/jimabug02-95wr2-worker-westus-b2cpv/proxy/metrics/cadvisor | grep container_network_transmit | grep $pod_name
Actual results:
2 It failed to report container_network* metrics
Expected results:
2 It should report container_network* metrics
Additional info:
This may be a regression issue, we hit it in 4.14 https://issues.redhat.com/browse/OCPBUGS-13741
Description of problem:
i18n misses for some provisioner on Create storageclass page Navigation to Storage -> StorageClasses -> Create StorageClass page For Provisioner -> kubernetes.io/glusterfs Missed: Gluster REST/Heketi URL Issue: For Provisioner -> kubernetes.io/quobyte Missed: User For Provisioner -> kubernetes.io/vsphere-volume Missed: Disk format For Provisioner -> kubernetes.io/portworx-volume Missed: Filesystem, Select Filesystem, For Provisioner -> kubernetes.io/scaleio Missed: Reference to a configured Secret object Missed: Select Provisioner for placeholder text
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-19-002129
How reproducible:
Always
Steps to Reproduce:
1. Add ?pseudolocalization=true&lng=en at the end of URL 2. Navigation to Storage -> StorageClasses -> Create StorageClass page,click the provisioner dropdown list, choose the provisioner 3. Check whether the text is in i18n mode
Actual results:
the text is not in i18n mode
Expected results:
the text should in i18n mode
Additional info:
As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.
Example:
compute: - name: worker architecture: arm64 ... - name: edge architecture: amd64 platform: aws: zones: ${edge_zones_str}
See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631
Description of problem:
login admin console, go to "Observe -> Metrics" page, there is one additional and useless button to the left of "Actions" button. see picture: https://drive.google.com/file/d/11CxilYmIzRyrcaISHje4QYhMsx9It3TU/view?usp=drive_link,
according to 4.17, the button is for Refresh interval, but it failed to load
NOTE: same issue for the developer console
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-07-200953
How reproducible:
always
Steps to Reproduce:
1. login admin/developer console, go to "Observe -> Metrics" page
Actual results:
Refresh interval button on "Observe -> Metrics" page failed to load
Expected results:
no error
Additional info:
Description of problem:
The samples operator sync for OCP 4.18 includes an update to the ruby imagestream. This removes EOLed versions of Ruby and upgrades the images to be ubi9 based
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Run build suite tests 2. 3.
Actual results:
Tests fail trying to pull image. Example: Error pulling image "image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8": initializing source docker://image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8: reading manifest 3.0-ubi8 in image-registry.openshift-image-registry.svc:5000/openshift/ruby: manifest unknown
Expected results:
Builds can pull image, and the tests succeed.
Additional info:
As part of the continued deprecation of the Samples Operator, these tests should create their own Ruby imagestream that is kept current.
Description of problem:
The example fails in the CI of the Samples Operator because it references a base image (perl:5.30-el7) that is no longer available in the OpenShift library. This needs to be fixed to unblock the release of the Samples Operator for OCP 4.17. There are essentially 2 ways to fix this: 1. Fix the Perl test template to reference a Perl image available in the OpenShift library. 2. Remove the test (which might be OK because the template seems to actually only be used in the tests).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
The test breaks here: https://github.com/openshift/origin/blob/master/test/extended/image_ecosystem/s2i_perl.go#L78 and the line in the test template that specifies the outdated Perl image is here: https://github.com/openshift/origin/blob/master/test/extended/testdata/image_ecosystem/perl-hotdeploy/perl.json#L50
Description of problem:
When we enable OCB in the worker pool and a new image is build, once the builder pod has finished building the image it takes about 10-20 minutes to start applying this new image in the first node.
Version-Release number of selected component (if applicable):
The issue was found while pre-merge verifying https://github.com/openshift/machine-config-operator/pull/4395
How reproducible:
Always
Steps to Reproduce:
1. Enable techpreview 2. Create this MOSC oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: buildOutputs: currentImagePullSecret: name: $(oc get -n openshift-machine-config-operator sa default -ojsonpath='{.secrets[0].name}') machineConfigPool: name: worker buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" containerFile: - containerfileArch: noarch content: |- # Pull the centos base image and enable the EPEL repository. FROM quay.io/centos/centos:stream9 AS centos RUN dnf install -y epel-release # Pull an image containing the yq utility. FROM docker.io/mikefarah/yq:latest AS yq # Build the final OS image for this MachineConfigPool. FROM configs AS final # Copy the EPEL configs into the final image. COPY --from=yq /usr/bin/yq /usr/bin/yq COPY --from=centos /etc/yum.repos.d /etc/yum.repos.d COPY --from=centos /etc/pki/rpm-gpg/RPM-GPG-KEY-* /etc/pki/rpm-gpg/ # Install cowsay and ripgrep from the EPEL repository into the final image, # along with a custom cow file. RUN sed -i 's/\$stream/9-stream/g' /etc/yum.repos.d/centos*.repo && \ rpm-ostree install cowsay ripgrep EOF
Actual results:
The machine-os-builder pod will be created, then the build pod will be created too, the image will be built and then it will take about 10-20 minutes to start applying the new build in the first node.
Expected results:
After MCO finishes building the image it should not take 10/20 minutes to start applying the image in the first node.
Additional info:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This ticket was created by ART pipline run sync-ci-images
Description of problem:
Circular dependencies in OCP Console prevent migration of Webpack 5
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Enable the CHECK_CYCLES env var while building 2. Observe errors 3.
Actual results:
There are errors
Expected results:
No errors
Additional info:
These are the cycles I can observe in public:
webpack compilation dbe21e029f8714842299 41 total cycles, 26 min-length cycles (A -> B -> A) Cycle count per directory: public (41) Index files occurring within cycles: public/components/secrets/create-secret/index.tsx (9) public/components/utils/index.tsx (4) public/module/k8s/index.ts (2) public/components/graphs/index.tsx (1) frontend/public/tokener.html public/tokener.html public/tokener.html frontend/public/index.html public/index.html public/index.html frontend/public/redux.ts public/redux.ts public/reducers/features.ts public/actions/features.ts public/redux.ts frontend/public/co-fetch.ts public/co-fetch.ts public/module/auth.js public/co-fetch.ts frontend/public/actions/features.ts public/actions/features.ts public/redux.ts public/reducers/features.ts public/actions/features.ts frontend/public/components/masthead.jsx public/components/masthead.jsx public/components/masthead-toolbar.jsx public/components/about-modal.tsx public/components/masthead.jsx frontend/public/components/utils/index.tsx public/components/utils/index.tsx public/components/utils/kebab.tsx public/components/utils/index.tsx frontend/public/module/k8s/index.ts public/module/k8s/index.ts public/module/k8s/k8s.ts public/module/k8s/index.ts frontend/public/reducers/features.ts public/reducers/features.ts public/actions/features.ts public/redux.ts public/reducers/features.ts frontend/public/module/auth.js public/module/auth.js public/co-fetch.ts public/module/auth.js frontend/public/components/cluster-settings/cluster-settings.tsx public/components/cluster-settings/cluster-settings.tsx public/components/cluster-settings/cluster-operator.tsx public/components/cluster-settings/cluster-settings.tsx frontend/public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx frontend/public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/utils.ts public/components/secret.jsx public/components/secrets/create-secret/index.tsx frontend/public/components/masthead-toolbar.jsx public/components/masthead-toolbar.jsx public/components/about-modal.tsx public/components/masthead.jsx public/components/masthead-toolbar.jsx frontend/public/actions/features.gql public/actions/features.gql public/actions/features.gql frontend/public/components/utils/kebab.tsx public/components/utils/kebab.tsx public/components/utils/index.tsx public/components/utils/kebab.tsx frontend/public/module/k8s/k8s.ts public/module/k8s/k8s.ts public/module/k8s/index.ts public/module/k8s/k8s.ts frontend/public/module/k8s/swagger.ts public/module/k8s/swagger.ts public/module/k8s/index.ts public/module/k8s/swagger.ts frontend/public/graphql/client.gql public/graphql/client.gql public/graphql/client.gql frontend/public/components/cluster-settings/cluster-operator.tsx public/components/cluster-settings/cluster-operator.tsx public/components/cluster-settings/cluster-settings.tsx public/components/cluster-settings/cluster-operator.tsx frontend/public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx frontend/public/components/monitoring/receiver-forms/webhook-receiver-form.tsx public/components/monitoring/receiver-forms/webhook-receiver-form.tsx public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx public/components/monitoring/receiver-forms/webhook-receiver-form.tsx frontend/public/components/monitoring/receiver-forms/email-receiver-form.tsx public/components/monitoring/receiver-forms/email-receiver-form.tsx public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx public/components/monitoring/receiver-forms/email-receiver-form.tsx frontend/public/components/monitoring/receiver-forms/slack-receiver-form.tsx public/components/monitoring/receiver-forms/slack-receiver-form.tsx public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx public/components/monitoring/receiver-forms/slack-receiver-form.tsx frontend/public/components/secrets/create-secret/utils.ts public/components/secrets/create-secret/utils.ts public/components/secret.jsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/utils.ts frontend/public/components/secrets/create-secret/CreateConfigSubform.tsx public/components/secrets/create-secret/CreateConfigSubform.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/CreateConfigSubform.tsx frontend/public/components/secrets/create-secret/UploadConfigSubform.tsx public/components/secrets/create-secret/UploadConfigSubform.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/UploadConfigSubform.tsx frontend/public/components/secrets/create-secret/WebHookSecretForm.tsx public/components/secrets/create-secret/WebHookSecretForm.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/WebHookSecretForm.tsx frontend/public/components/secrets/create-secret/SSHAuthSubform.tsx public/components/secrets/create-secret/SSHAuthSubform.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/SSHAuthSubform.tsx frontend/public/components/secrets/create-secret/GenericSecretForm.tsx public/components/secrets/create-secret/GenericSecretForm.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/GenericSecretForm.tsx frontend/public/components/secrets/create-secret/KeyValueEntryForm.tsx public/components/secrets/create-secret/KeyValueEntryForm.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/KeyValueEntryForm.tsx frontend/public/components/secrets/create-secret/CreateSecret.tsx public/components/secrets/create-secret/CreateSecret.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/CreateSecret.tsx frontend/public/components/secrets/create-secret/SecretSubForm.tsx public/components/secrets/create-secret/SecretSubForm.tsx public/components/secrets/create-secret/index.tsx public/components/secrets/create-secret/SecretSubForm.tsx frontend/public/components/about-modal.tsx public/components/about-modal.tsx public/components/masthead.jsx public/components/masthead-toolbar.jsx public/components/about-modal.tsx frontend/public/components/graphs/index.tsx public/components/graphs/index.tsx public/components/graphs/status.jsx public/components/graphs/index.tsx frontend/public/components/modals/error-modal.tsx public/components/modals/error-modal.tsx public/components/utils/index.tsx public/components/utils/webhooks.tsx public/components/modals/error-modal.tsx frontend/public/components/image-stream.tsx public/components/image-stream.tsx public/components/image-stream-timeline.tsx public/components/image-stream.tsx frontend/public/components/graphs/status.jsx public/components/graphs/status.jsx public/components/graphs/index.tsx public/components/graphs/status.jsx frontend/public/components/build-pipeline.tsx public/components/build-pipeline.tsx public/components/utils/index.tsx public/components/utils/build-strategy.tsx public/components/build.tsx public/components/build-pipeline.tsx frontend/public/components/build-logs.jsx public/components/build-logs.jsx public/components/utils/index.tsx public/components/utils/build-strategy.tsx public/components/build.tsx public/components/build-logs.jsx frontend/public/components/image-stream-timeline.tsx public/components/image-stream-timeline.tsx public/components/image-stream.tsx public/components/image-stream-timeline.tsx
Description of problem:
The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.
Version-Release number of selected component (if applicable):
RHOCP 4.16.4
How reproducible:
100%
Steps to Reproduce:
1. Configure proxy custom resource in RHOCP 4.16.4 cluster 2. Create cluster-monitoring-config configmap in openshift-monitoring project 3. Inject remote-write config (without specifically configuring proxy for remote-write) 4. After saving the modification in cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet: ============== apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: [...] name: k8s namespace: openshift-monitoring spec: [...] remoteWrite: - proxyUrl: http://proxy.abc.com:8080 <<<<<====== Injected Automatically but there is no noProxy URL. url: http://test-remotewrite.test.svc.cluster.local:9090
Actual results:
The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.
Expected results:
The noProxy URL should get injected in Prometheus k8s CR as well.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
'Are you sure' pop-up windows on 'Create NetworkPolicy' -> Policy type section -> both for Ingress and Egress does not closes automatically after user triggering the 'Remove all' action
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-09-120947 4.18.0-0.nightly-2024-09-09-212926
How reproducible:
Always
Steps to Reproduce:
1. Naviage to Networking -> NetworkPolicies page, click 'create NetworkPolicies' button, and change to Form view 2. On Policy type -> Ingress/Egress section, click 'Add Ingress rule' buttong 3. Click 'Remove all', and trigger 'remove all' action on the pops-up windows
Actual results:
The ingress/egress data has been removed, but the pops up windows are not closed automatically
Expected results:
Compare with the same behavior on OCP4.16, after the 'Remove all' action is triggered and executed successfully, the windows will be closed automatically
Additional info:
Description of problem:
As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate. However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
$ oc get featuregates.config.openshift.io cluster -oyaml <......> spec: featureSet: TechPreviewNoUpgrade status: featureGates: enabled: - name: ExternalRouteCertificate - name: RouteExternalCertificate <......>
Actual results:
Both RouteExternalCertificate and ExternalRouteCertificate were added in the API
Expected results:
We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html
Additional info:
Git commits https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3 https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930 Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219
Description of problem:
Adding a node with `oc adm node-image` fails: oc adm node-image monitor --ip-addresses 192.168.250.77 time=2024-10-10T11:31:19Z level=info msg=Monitoring IPs: [192.168.250.77] time=2024-10-10T11:31:19Z level=info msg=Cannot resolve IP address 192.168.250.77 to a hostname. Skipping checks for pending CSRs. time=2024-10-10T11:31:19Z level=info msg=Node 192.168.250.77: Assisted Service API is available time=2024-10-10T11:31:19Z level=info msg=Node 192.168.250.77: Cluster is adding hosts time=2024-10-10T11:31:19Z level=warning msg=Node 192.168.250.77: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking
Version-Release number of selected component (if applicable):
4.17.0
How reproducible: Always
Extra information:
The cluster is deployed using Platform:None and userManagedNetworking on an OpenStack cluster which is used as a test bed for the real hardware Agent Based Installer.
Bootstrap of the cluster itself is successfull, but adding nodes as day 2 is not working.
During the cluster bootstrap, we see the following log message:
{\"id\":\"valid-platform-network-settings\",\"status\":\"success\",\"message\":\"Platform OpenStack Compute is allowed\"}So after looking at https://github.com/openshift/assisted-service/blob/master/internal/host/validator.go#L569
we suppose that the error is related to `userManagedNetworking`
being set to true when bootstraping and false when adding a node.
A second related issue, is why the platform is seen as openstack, as neither the cluster-config-v1 configmap containing install-config or the infrastructure/cluster object mentions OpenStack.
Not sure if this is relevant but an external CNI plugin is used here, we have networkType: Calico in the install config.
Description of problem:
In CONSOLE-4187, the metrics page was removed from the console, but some related packages (i.e., the codemirror ones) remained, even though they are now unnecessary
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699
Version-Release number of selected component (if applicable):
How reproducible:
Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.
Steps to Reproduce:
1. 2. 3.
Actual results:
The installation will fail.
Expected results:
The installation succeeds to create a Nutanix OCP cluster with the DHCP network.
Additional info:
When provisioning a cluster using IPI with FIPS enabled,
if using virtual media then then IPA fails to boot with FIPS, there is an error in machine-os-images
Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: Adding kernel argument ip=dhcp Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: Adding kernel argument fips=1 Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: /bin/copy-iso: line 34: [: ip=dhcp: binary operator expected
Description of problem:
[AWS]Installer should have pre-check for user tags
Version-Release number of selected component (if applicable):
4.18
How reproducible:
always
Steps to Reproduce:
Setting user tags as below in install-config: userTags: usage-user: cloud-team-rebase-bot[bot] The user tags will be applied to many resources, including roles, but [] does not allowed to tag to roles https://drive.google.com/file/d/148y-cYrfzNQzDwWlUrgMYAGsZAY6gbW4/view?usp=sharing
Actual results:
Installation failed as failed to create IAM roles, ref job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api-provider-aws/529/pull-ci-openshift-cluster-api-provider-aws-master-regression-clusterinfra-aws-ipi-proxy-techpreview/1852197133122277376
Expected results:
Installer should have pre-check for this scenario and exit with error message if user tags contain unsupported chars
Additional info:
discussion on slack: https://redhat-internal.slack.com/archives/CF8SMALS1/p1730443557188649
Description of problem:
Move Events option above Event Source and rename it to Event Types. And Keep the Eventing option together on add page.
Validation failures in assisted-service are reported to the user in the output of openshift-install agent wait-for bootstrap-complete. However, when reporting issues to support or escalating to engineering, we quite often have only the agent-gather archive to go on.
Most validation failures in assisted-service are host validations. These can be reconstructed with some difficulty from the assisted-service log, and are readily available in that log starting with 4.17 since we enabled debugging in AGENT-944.
However, there are also cluster validation failures and these are not well logged.
Description of problem: Clicking Size control in PVC form throws a warning error. See the below and attached:
`react-dom.development.js:67 Warning: A component is changing an uncontrolled input to be controlled.`
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Goto PVC form and open the Browser Dev console 2. Click on the Size control to set a value The warning error `Warning: A component is changing an uncontrolled input to be controlled. This is likely caused by the value changing from undefined to a defined value, which should not happen. Decide between using a controlled or uncontrolled input element for the lifetime of the component.` is logged out in the console tab.
Actual results:
Expected results:
Additional info:
Description of problem:
clicking on any route to view its detail will wrongly take route name as selected project name
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-212926
How reproducible:
Always
Steps to Reproduce:
1. goes to Routes list page 2. click on any route name 3.
Actual results:
2. the route name will be taken as selected project name so the page will always be loading because the project doesn't exist
Expected results:
2. route detail page should be returned
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
Create cluster with publish:Mixed by using CAPZ, 1. publish: Mixed + apiserver: Internal install-config: ================= publish: Mixed operatorPublishingStrategy: apiserver: Internal ingress: External In this case, api dns should not be created in public dns zone, but it was created. ================== $ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com { "TTL": 300, "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19", "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api", "metadata": {}, "name": "api.jima07api", "provisioningState": "Succeeded", "resourceGroup": "os4-common", "targetResource": {}, "type": "Microsoft.Network/dnszones/CNAME" } 2. publish: Mixed + ingress: Internal install-config: ============= publish: Mixed operatorPublishingStrategy: apiserver: External ingress: Internal In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found. ================ $ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg []
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Specify publish: Mixed + mixed External/Internal for api/ingress 2. Create cluster 3. check public dns records and load balancer rules in internal/external load balancer to be created expected
Actual results:
see description, some resources are unexpected to be created or missed.
Expected results:
public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config
Additional info:
Description of problem:
Multipart upload issues with Cloudflare R2 using S3 api. Some S3 compatible object storage systems like R2 require that all multipart chunks are the same size. This was mostly true before, except the final chunk was larger than the requested chunk size which causes uploads to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Problem shows itself on OpenShift CI clusters intermittently.
Steps to Reproduce:
This behavior has been causing 504 Gateway Timeout issues in the image registry instances in OpenShift CI clusters. It is connected to uploading big images (i.e 35GB), but we do not currently have the exact steps that reproduce it. 1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/distribution/distribution/issues/3873 https://github.com/distribution/distribution/issues/3873#issuecomment-2258926705 https://developers.cloudflare.com/r2/api/workers/workers-api-reference/#r2multipartupload-definition (look for "uniform in size")
There is a typo here: https://github.com/openshift/installer/blob/release-4.18/upi/openstack/security-groups.yaml#L370
It should be os_subnet6_range.
That task is only run if os_master_schedulable is defined and greater to 0 in the inventory.yaml
Please review the following PR: https://github.com/openshift/images/pull/195
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/525
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-aws-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
When the user provides an existing VPC, the IBM CAPI will not add ports 443, 5000, and 6443 to the VPC's security group. It is safe to always check for these ports since we only add them if they are missing.
Update kubernetes-apiserver and openshift-apiserver to use k8s 1.31.x which is currently in use for OCP 4.18.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
B[Staging]BE 2.35.0, UI 2.34.2 - [Staging] - UI allows LVMS and ODF to be selected and then throws an error
How reproducible:
100%
Steps to reproduce:
1.
Actual results:
Expected results:
Description of problem:
when normal user tries to create namespace scoped network policy, selected project in project selection dropdown was not taken
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-17-183402
How reproducible:
Always
Steps to Reproduce:
1. normal user with a project view networkpolicy page /k8s/ns/yapei1-1/networkpolicies/~new/form 2. Hit on 'affected pods' in Pod selector section OR keep everything with default value and click on 'Create'
Actual results:
2. User will see following error when click on 'affected pods' Can't preview pods r: pods is forbidden: User "yapei1" cannot list resource "pods" in API group "" at the cluster scope User will see following error when click on 'Create' button An error occurrednetworkpolicies.networking.k8s.io is forbidden: User "yapei1" cannot create resource "networkpolicies" in API group "networking.k8s.io" at the cluster scope
Expected results:
2. switching to 'YAML view' we can see that the selected project name was not auto populated in YAML
Additional info:
Description of problem:
Alert that have been silenced are still seen on Console overview page,
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1.for a cluster installed on version 4.15 2. Silence a alert that is firing by going to Console --> Observe --> Alerting --> Alerts 3. Check if the alert is added to silenced alert Console --> Observe --> Alerting --> Silences 4. Go back to Console (Overview page) silenced alert is still seen there
Actual results:
Silenced alert can be seen on ocp overview page
Expected results:
Silenced alert should not be seen on overview page
Additional info:
Description of problem:
Navigation: Storage -> PersistentVolumeClaims -> Details -> Mouse hover on 'PersistentVolumeClaim details' diagram Issue: "Available" translated in-side diagram but not in mouse hover text
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-01-063526
How reproducible:
Always
Steps to Reproduce:
1. Log into web console and set language to non en_US 2. Navigate to Storage -> PersistentVolumeClaims 3. Click on PersistentVolumeClaim from list 4. In Details tab, mouse hover on 'PersistentVolumeClaim details' diagram 5. Text "xx.yy GiB Available" in English. 6. Same "Available" translated in-side diagram but not in mouse hover text
Actual results:
"Available" translated in-side diagram but not in mouse hover text
Expected results:
"Available" in mouse hover text should be in set language
Additional info:
screenshot reference attached
Description of problem:
FDP released a new OVS 3.4 version, that will be used on the host.
We want to maintain the same version in the container.
This is mostly needed for OVN observability feature.
Our e2e jobs fail with:
pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError" pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError" pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError"
The jobs should succeed.
Description of problem:
Various tests in Console's master branch CI are failing due to missing content of <li.pf-v5-c-menu__list-item> element. Check https://search.dptools.openshift.org/?search=within+the+element%3A+%3Cli.pf-v5-c-menu__list-item%3E+but+never+did&maxAge=168h&context=1&type=all&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The actual issue is with the created project via CLI, which is not being available in the NS dropdown
When one of our partner was trying to deploy a 4.16 Spoke cluster with ZTP/Gitops Approach, they get the following error message in their assisted-service pod:
error msg="failed to get corresponding infraEnv" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:409" error="record not found" go-id=497 preprovisioning_image=storage-1.fi-911.tre.nsn-rdnet.net preprovisioning_image_namespace=fi-911 request_id=cc62d8f6-d31f-4f74-af50-3237df186dc2
After some discussion in Assisted-Installer forum(https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1723196754444999), Nick Carboni and Alona Paz suggested that "identifier: mac-address" is not supported. Partner has currently ACM 2.11.0 and MCE 2.6.0 versions. However, their older cluster had ACM 2.10 and MCE 2.4.5 and this parameter was working. Nick and Alona suggested to remove "identifier: mac-address" from siteconfig and then installation started to progress. Based on suggestion from Nick, I opened this bug ticket to understand why it started not work now. Partner asked for an official documentation on why this parameter is no more working anymore or if this parameter is not supported any more.
Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
On "VolumeSnapshot" list page, when project dropdown is "All Projects", click "Create VolumeSnapshot", the project "Undefined" is shown on project field.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-27-213503 4.18.0-0.nightly-2024-09-28-162600
How reproducible:
Always
Steps to Reproduce:
1.Go to "VolumeSnapshot" list page, set "All Projects" in project dropdown list. 2.Click "Create VolumeSnapshot", check project field on the creation page. 3.
Actual results:
2. The project is "Undefined"
Expected results:
2. The project should be "default".
Additional info:
Description of problem:
Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.
lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
50%
Steps to Reproduce:
1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI
Actual results:
Flakes
Expected results:
Shouldn't flake
Additional info:
CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB
CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations
Hello Team,
After the hard reboot of all nodes due to a power outage, failure of image pull of NTO preventing "ocp-tuned-one-shot.service" startup result in dependency failure for kubelet and crio services,
------------
journalctl_--no-pager
Aug 26 17:07:46 ocp05 systemd[1]: Reached target The firstboot OS update has completed.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3577]: NM resolv-prepender: Starting download of baremetal runtime cfg image
Aug 26 17:07:46 ocp05 systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Aug 26 17:07:46 ocp05 systemd[1]: Starting TuneD service from NTO image...
Aug 26 17:07:46 ocp05 nm-dispatcher[3687]: NM resolv-prepender triggered by lo up.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3644]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ lo == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + exit 0
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + exit 0
Aug 26 17:07:46 ocp05 bash[3655]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 podman[3661]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26...
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Main process exited, code=exited, status=125/n/a
Aug 26 17:07:46 ocp05 nm-dispatcher[3793]: NM resolv-prepender triggered by brtrunk up.
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Failed with result 'exit-code'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ brtrunk == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + exit 0
Aug 26 17:07:46 ocp05 systemd[1]: Failed to start TuneD service from NTO image.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Dependencies necessary to run kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Kubernetes Kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet.service: Job kubelet.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Container Runtime Interface for OCI (CRI-O).
Aug 26 17:07:46 ocp05 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet-dependencies.target: Job kubelet-dependencies.target/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + exit 0
-----------
-----------
$ oc get proxy config cluster -oyaml
status:
httpProxy: http://proxy_ip:8080
httpsProxy: http://proxy_ip:8080
$ cat /etc/mco/proxy.env
HTTP_PROXY=http://proxy_ip:8080
HTTPS_PROXY=http://proxy_ip:8080
-----------
-----------
× ocp-tuned-one-shot.service - TuneD service from NTO image
Loaded: loaded (/etc/systemd/system/ocp-tuned-one-shot.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-08-26 17:07:46 UTC; 2h 30min ago
Main PID: 3661 (code=exited, status=125)
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
-----------
Description of problem:
When we added new bundle metadata encoding as `olm.csv.metadata` in https://github.com/operator-framework/operator-registry/pull/1094 (downstreamed for 4.15+) we created situations where - konflux onboarded operators, encouraged to use upstream:latest to generate FBC from templates; and - IIB-generated catalog images which used earlier opm versions to serve content could generate the new format but not be able to serve it. One only has to `opm render` an SQLite catalog image, or expand a catalog template.
Version-Release number of selected component (if applicable):
How reproducible:
every time
Steps to Reproduce:
1. opm render an SQLite catalog image 2. 3.
Actual results:
uses `olm.csv.metadata` in the output
Expected results:
only using `olm.bundle.object` in the output
Additional info:
Description of problem:
When a HostedCluster is upgraded to a new minor version, its OLM catalog imagestreams are not updated to use the tag corresponding to the new minor version.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster (4.15.z) 2. Upgrade the HostedCluster to a new minor version (4.16.z)
Actual results:
OLM catalog imagestreams remain at the previous version (4.15)
Expected results:
OLM catalog imagestreams are updated to new minor version (4.16)
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/95
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Last app-sre security scan in production shows issues with the openshift/origin-oauth-proxy image.
https://grafana.stage.devshift.net/d/eds0cjpeszz0ge/acs-cvss?orgId=1
/cc Alona Kaplan
this is case 2 from OCPBUGS-14673
Description of problem:
MHC for control plane cannot work right for some cases 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right. https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-05-112833 4.13.0-0.nightly-2023-06-06-194351 4.12.0-0.nightly-2023-06-07-005319
How reproducible:
Always
Steps to Reproduce:
1.Create MHC for control plane apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: control-plane-health namespace: openshift-machine-api spec: maxUnhealthy: 1 selector: matchLabels: machine.openshift.io/cluster-api-machine-type: master unhealthyConditions: - status: "False" timeout: 300s type: Ready - status: "Unknown" timeout: 300s type: Ready liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml machinehealthcheck.machine.openshift.io/control-plane-health created liuhuali@Lius-MacBook-Pro huali-test % oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY control-plane-health 1 3 3 machine-api-termination-handler 100% 0 0 Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1 Starting pod/huliu-az7c-svq9q-master-1-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# systemctl stop kubelet Removing debug pod ... liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 Ready control-plane,master 95m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 95m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 19m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 34m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 47m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 83m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Running Standard_D8s_v3 westus 97m huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 97m huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 23m huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 39m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 53m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 91m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 NotReady control-plane,master 107m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 107m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 32m v1.26.5+7a891f0 huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 2m10s v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 46m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 59m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 95m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 110m huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 110m huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 36m huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 5m55s huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 52m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 65m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 103m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 3h huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 3h huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 105m huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 75m huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 122m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 135m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 173m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 NotReady control-plane,master 178m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 178m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 102m v1.26.5+7a891f0 huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 72m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 116m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 129m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 165m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.13.0-0.nightly-2023-06-06-194351 True False False 174m cloud-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 176m cloud-credential 4.13.0-0.nightly-2023-06-06-194351 True False False 3h cluster-autoscaler 4.13.0-0.nightly-2023-06-06-194351 True False False 173m config-operator 4.13.0-0.nightly-2023-06-06-194351 True False False 175m console 4.13.0-0.nightly-2023-06-06-194351 True False False 136m control-plane-machine-set 4.13.0-0.nightly-2023-06-06-194351 True False False 71m csi-snapshot-controller 4.13.0-0.nightly-2023-06-06-194351 True False False 174m dns 4.13.0-0.nightly-2023-06-06-194351 True True False 173m DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7." etcd 4.13.0-0.nightly-2023-06-06-194351 True True True 173m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.13.0-0.nightly-2023-06-06-194351 True True False 165m Progressing: The registry is ready... ingress 4.13.0-0.nightly-2023-06-06-194351 True False False 165m insights 4.13.0-0.nightly-2023-06-06-194351 True False False 168m kube-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.13.0-0.nightly-2023-06-06-194351 True False False 106m machine-api 4.13.0-0.nightly-2023-06-06-194351 True False False 167m machine-approver 4.13.0-0.nightly-2023-06-06-194351 True False False 174m machine-config 4.13.0-0.nightly-2023-06-06-194351 False False True 60m Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)] marketplace 4.13.0-0.nightly-2023-06-06-194351 True False False 174m monitoring 4.13.0-0.nightly-2023-06-06-194351 True False False 106m network 4.13.0-0.nightly-2023-06-06-194351 True True False 177m DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)... node-tuning 4.13.0-0.nightly-2023-06-06-194351 True False False 173m openshift-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 170m openshift-samples 4.13.0-0.nightly-2023-06-06-194351 True False False 167m operator-lifecycle-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 174m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-06-06-194351 True False False 174m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-06-06-194351 True False False 168m service-ca 4.13.0-0.nightly-2023-06-06-194351 True False False 175m storage 4.13.0-0.nightly-2023-06-06-194351 True True False 174m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... liuhuali@Lius-MacBook-Pro huali-test % ----------------------- There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.
Actual results:
CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9). more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712 This causes the etcd member to not be removed. Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.
Expected results:
The etcd member should be removed and the machine/node should be deleted
Additional info:
Removing the revision check does fix this issue reliably, but might not be desirable: https://github.com/openshift/cluster-etcd-operator/pull/1087
Description of problem:
Once min-node is reached, the remain nodes' taints shouldn't have DeletionCandidateOfClusterAutoscaler
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-arm64-2024-09-13-023103
How reproducible:
Always
Steps to Reproduce:
1.Create ipi cluster 2.Create machineautoscaler and clusterautoscaler 3.Create workload so that , scaling would happen
Actual results:
DeletionCandidateOfClusterAutoscaler, taint are present even after min nodes are reached
Expected results:
above taints not present on nodes once min node count is reached
Additional info:
logs from the test - https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/1037951/console
must-gather - https://drive.google.com/file/d/1zB2r-BRHjC12g17_Abc-xvtEqpJOopI5/view?usp=sharing
We did reproduce it manually and waited around 15 mins, taint was present.
Description of problem:
When the TelemeterClientFailures alert fires, there's no runbook link explaining the meaning of the alert and what to do about it.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Check the TelemeterClientFailures alerting rule's annotations 2. 3.
Actual results:
No runbook_url annotation.
Expected results:
runbook_url annotation is present.
Additional info:
This is a consequence of a telemeter server outage that triggered questions from customers about the alert: https://issues.redhat.com/browse/OHSS-25947 https://issues.redhat.com/browse/OCPBUGS-17966 Also in relation to https://issues.redhat.com/browse/OCPBUGS-17797
When adding a BMH with
spec: online: true customDeploy: method: install_coreos
after inspection the BMO will provision the node in ironic
but the node is now being created without any userdata/ignition data,
IPA ironic_coreos_install then goes down a seldom used path to create ignition from scratch, the created ignition is invalid and the node fails to boot after it is provisioned.
Boot stalls with a ignition error "invalid config version (couldn't parse)"
Sync downstream with upstream
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/853
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Prometheus write_relabel_configs in remotewrite unable to drop metric in Grafana
Version-Release number of selected component (if applicable):
How reproducible:
Customer has tried both configurations to drop MQ metric with source_label(configuration 1) or without source_label(configuration 2) but it's not working. It seems to me that drop configuration is not working properly and is buggy. Configuration 1: ``` remoteWrite: - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push" write_relabel_configs: - source_labels: ['__name__'] regex: 'ibmmq_qmgr_uptime' action: 'drop' basicAuth: username: name: kubepromsecret key: username password: name: kubepromsecret key: password ``` Configuration 2: ``` remoteWrite: - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push" write_relabel_configs: - regex: 'ibmmq_qmgr_uptime' action: 'drop' basicAuth: username: name: kubepromsecret key: username password: name: kubepromsecret key: password ``` Customer wants to know what's the correct remote_write configuration to drop metric in Grafana ? Document links: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#configuring-remote-write-storage_configuring-the-monitoring-stack https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#creating-user-defined-workload-monitoring-configmap_configuring-the-monitoring-stack
Steps to Reproduce:
1. 2. 3.
Actual results:
prometheus remote_write configurations NOT droppping metric in Grafana
Expected results:
prometheus remote_write configurations should drop metric in Grafana
Additional info:
Description of problem:
Using payload built with https://github.com/openshift/installer/pull/8666/ so that master instances can be provisioned from gen2 image, which is required when configuring security type in install-config. Enable TrustedLaunch security type in install-config: ================== controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: encryptionAtHost: true settings: securityType: TrustedLaunch trustedLaunch: uefiSettings: secureBoot: Enabled virtualizedTrustedPlatformModule: Enabled Launch capi-based installation, installer failed after waiting 15min for machines to provision... INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5 INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5-gen2 INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master INFO Waiting up to 15m0s (until 6:26AM UTC) for machines [jima08conf01-9vgq5-bootstrap jima08conf01-9vgq5-master-0 jima08conf01-9vgq5-master-1 jima08conf01-9vgq5-master-2] to provision... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: azure infrastructure provider INFO Stopped controller: azureaso infrastructure provider INFO Local Cluster API system has completed operations In openshift-install.log, time="2024-07-08T06:25:49Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jima08conf01-9vgq5-rg/jima08conf01-9vgq5-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/virtualMachines/jima08conf01-9vgq5-bootstrap" time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-08T06:25:49Z" level=debug msg="\tRESPONSE 400: 400 Bad Request" time="2024-07-08T06:25:49Z" level=debug msg="\tERROR CODE: BadRequest" time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-08T06:25:49Z" level=debug msg="\t{" time="2024-07-08T06:25:49Z" level=debug msg="\t \"error\": {" time="2024-07-08T06:25:49Z" level=debug msg="\t \"code\": \"BadRequest\"," time="2024-07-08T06:25:49Z" level=debug msg="\t \"message\": \"Use of TrustedLaunch setting is not supported for the provided image. Please select Trusted Launch Supported Gen2 OS Image. For more information, see https://aka.ms/TrustedLaunch-FAQ.\"" time="2024-07-08T06:25:49Z" level=debug msg="\t }" time="2024-07-08T06:25:49Z" level=debug msg="\t}" time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------" time="2024-07-08T06:25:49Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/jima08conf01-9vgq5-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"jima08conf01-9vgq5-bootstrap\" reconcileID=\"bee8a459-c3c8-4295-ba4a-f3d560d6a68b\"" Looks like that capi-based installer missed to enable security features during creating gen2 image, which can be found in terraform code. https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L166-L169 Gen2 image definition created by terraform: $ az sig image-definition show --gallery-image-definition jima08conf02-4mrnz-gen2 -r gallery_jima08conf02_4mrnz -g jima08conf02-4mrnz-rg --query 'features' [ { "name": "SecurityType", "value": "TrustedLaunch" } ] It's empty when querying from gen2 image created by using CAPI. $ az sig image-definition show --gallery-image-definition jima08conf01-9vgq5-gen2 -r gallery_jima08conf01_9vgq5 -g jima08conf01-9vgq5-rg --query 'features' $
Version-Release number of selected component (if applicable):
4.17 payload built from cluster-bot with PR https://github.com/openshift/installer/pull/8666/
How reproducible:
Always
Steps to Reproduce:
1. Enable security type in install-config 2. Create cluster by using CAPI 3.
Actual results:
Install failed.
Expected results:
Install succeeded.
Additional info:
It impacts installation with security type ConfidentialVM or TrustedLaunch enabled.
Description of the problem:
Cluster ** installation with static configuration for ipv4 and ipv6
Discovery done but without the configured ip addresses , installation aborted on bootstrap reboot.
https://redhat-internal.slack.com/archives/C02RD175109/p1727157947875779
Two issues:
#1 static configuration not applied because missing autoconf: 'false'\n"
It was working before but now its mandatory for ipv6
#2 need to update test-infra code.
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes. For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes) $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4 After 20 minutes or half an hour the MCPs start reporting the right number of nodes
Version-Release number of selected component (if applicable):
IPI on AWS version:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101
How reproducible:
Always
Steps to Reproduce:
1. Create a MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf: "" EOF 2. Add 2 nodes to the MCP $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf= 3. Create another MCP oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-perf-canary spec: machineConfigSelector: matchExpressions: - { key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-perf,worker-perf-canary] } nodeSelector: matchLabels: node-role.kubernetes.io/worker-perf-canary: "" EOF 3. Move one node from the MCP created in step 1 to the MCP created in step 3 $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary= $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
Actual results:
The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP. $ oc get mcp,nodes NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE machineconfigpool.machineconfiguration.openshift.io/master rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6 True False False 3 3 3 0 142m machineconfigpool.machineconfiguration.openshift.io/worker rendered-worker-36ee1fdc485685ac9c324769889c3348 True False False 1 1 1 0 142m machineconfigpool.machineconfiguration.openshift.io/worker-perf rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556 True False False 2 2 2 0 24m machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556 True False False 1 1 1 0 7m52s NAME STATUS ROLES AGE VERSION node/ip-10-0-13-228.us-east-2.compute.internal Ready worker,worker-perf-canary 138m v1.30.4 node/ip-10-0-2-250.us-east-2.compute.internal Ready control-plane,master 145m v1.30.4 node/ip-10-0-34-223.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-35-61.us-east-2.compute.internal Ready worker,worker-perf 136m v1.30.4 node/ip-10-0-79-232.us-east-2.compute.internal Ready control-plane,master 144m v1.30.4 node/ip-10-0-86-124.us-east-2.compute.internal Ready worker 139m v1.30.4
Expected results:
MCPs should always report the right number of nodes
Additional info:
It is very similar to this other issue https://bugzilla.redhat.com/show_bug.cgi?id=2090436 That was discussed in this slack conversation https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
Description of problem:
1. Creating a Normal User: ``` $ oc create user test user.user.openshift.io/test created $ oc get user NAME UID FULL NAME IDENTITIES test cef90f53-715e-4c10-9e26-c431d31de8c3 ``` This command worked as expected, and the user appeared correctly in both the CLI and the web console. 2. Using Special Characters: ``` $ oc create user test$*( > test) user.user.openshift.io/test( test) created $ oc get user NAME UID FULL NAME IDENTITIES test cef90f53-715e-4c10-9e26-c431d31de8c3 test(... 50f2ad2b-1385-4b3c-b32c-b84531808864 ``` In this case, the user was created successfully and displayed correctly in the web console as test( test). However, the CLI output was not as expected. 3. Handling Quoted Names: ``` $ oc create user test' > test' $ oc get user NAME UID FULL NAME IDENTITIES test... 1fdaadf0-7522-4d38-9894-ee046a58d835 ``` Similarly, creating a user with quotes produced a discrepancy: the CLI displayed test..., but the web console showed it as test test.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
Given in the description.
Actual results:
The user list is not getting listed properly.
Expected results:
1. User should not be created with a line break. 2. If they are being created, then they should be displayed properly.
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After upgrading from 4.12 to 4.14, the customer reports that the pods cannot reach their service when a NetworkAttachmentDefinition is set.
How reproducible:
Create a NetworkAttachmentDefinition
Steps to Reproduce:
1.Create a pod with a service. 2. Curl the service from inside the pod. Works. 3. Create a NetworkAttachmentDefinition. 4. The same curl does not work
Actual results:
Pod does not reach service
Expected results:
Pod reaches service
Additional info:
specifically updating the bug overview for posterity here but the specific issue is that we have pods set up with an exposed port (8080 - port doesn't matter), and a service with 1 endpoint pointing to the specific pod. We can call OTHER PODS in the same namespace via their single-endpoint call service, but we cannot call OURSELVES from inside the pod. The issue is with hairpinning loopback return. Is not affected by networkpolicy and appears to be an issue with (as discovered later in this jira) asymmetric routing in that return path to the container after it leaves the local net. This behavior is only observed when a network-attachment-definition is added to the pod and appears to be an issue with the way route rules are defined. A workaround is available to inject the container with a route specicically, or modify the Net-attach-def to ensure a loopback route is available to the container space.
KCS for this problem with workarounds + patch fix versions (when available): https://access.redhat.com/solutions/7084866
Description of problem:
Unable to deploy performance profile on multi nodepool hypershift cluster
Version-Release number of selected component (if applicable):
Server Version: 4.17.0-0.nightly-2024-07-28-191830 (management cluster) Server Version: 4.17.0-0.nightly-2024-08-08-013133 (hosted cluster)
How reproducible:
Always
Steps to Reproduce:
1. In a multi nodepool hypershift cluster, attach performance profile unique to each nodepool. 2. Check the configmap and nodepool status.
Actual results:
root@helix52:~# oc get cm -n clusters-foobar2 | grep foo kubeletconfig-performance-foobar2 1 21h kubeletconfig-pp2-foobar3 1 21h machineconfig-performance-foobar2 1 21h machineconfig-pp2-foobar3 1 21h nto-mc-foobar2 1 21h nto-mc-foobar3 1 21h performance-foobar2 1 21h pp2-foobar3 1 21h status-performance-foobar2 1 21h status-pp2-foobar3 1 21h tuned-performance-foobar2 1 21h tuned-pp2-foobar3 1 21h
root@helix52:~# oc get np NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE foobar2 foobar2 2 2 False False 4.17.0-0.ci-2024-08-08-225819 False True foobar3 foobar2 1 1 False False 4.17.0-0.ci-2024-08-08-225819 False True
Hypershift Pod logs - {"level":"debug","ts":"2024-08-14T08:54:27Z","logger":"events","msg":"there cannot be more than one PerformanceProfile ConfigMap status per NodePool. found: 2 NodePool: foobar3","type":"Warning","object":{"kind":"NodePool","namespace":"clusters","name":"foobar3","uid":"c2ba814a-31fe-409d-88c2-b4e6b9a41b26","apiVersion":"hypershift.openshift.io/v1beta1","resourceVersion":"6411003"},"reason":"ReconcileError"}
Expected results:
Performance profile should apply correctly on both node pools
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/294
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There are two additional zones, syd05 and us-east(dal13) that have PER capabilities but are not present in the installer. Add them.
Version-Release number of selected component (if applicable):
4.18.0
Description of problem:
When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment
Version-Release number of selected component (if applicable):
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Setup OSUS in a reacheable network 2. Cut all internet connection except for the mirror registry and OSUS service 3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig
Actual results:
Expected results:
Should not fail
Additional info:
Description of problem:
IBM Cloud CCM was reconfigured to use loopback as the bind address in 4.16. However, the liveness probe was not configured to use loopback too, so the CCM constantly fails the liveness probe and restarts continuously.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create a IPI cluster on IBM Cloud 2. Watch the IBM Cloud CCM pod and restarts, increase every 5 mins (liveness probe timeout)
Actual results:
# oc --kubeconfig cluster-deploys/eu-de-4.17-rc2-3/auth/kubeconfig get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-58f7747d75-j82z8 0/1 CrashLoopBackOff 262 (39s ago) 23h ibm-cloud-controller-manager-58f7747d75-l7mpk 0/1 CrashLoopBackOff 261 (2m30s ago) 23h Normal Killing 34m (x2 over 40m) kubelet Container cloud-controller-manager failed liveness probe, will be restarted Normal Pulled 34m (x2 over 40m) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ac9fb24a0e051aba6b16a1f9b4b3f9d2dd98f33554844953dd4d1e504fb301e" already present on machine Normal Created 34m (x3 over 45m) kubelet Created container cloud-controller-manager Normal Started 34m (x3 over 45m) kubelet Started container cloud-controller-manager Warning Unhealthy 29m (x8 over 40m) kubelet Liveness probe failed: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused Warning ProbeError 3m4s (x22 over 40m) kubelet Liveness probe error: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused body:
Expected results:
CCM runs continuously, as it does on 4.15 # oc --kubeconfig cluster-deploys/eu-de-4.15.10-1/auth/kubeconfig get po -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE ibm-cloud-controller-manager-66d4779cb8-gv8d4 1/1 Running 0 63m ibm-cloud-controller-manager-66d4779cb8-pxdrs 1/1 Running 0 63m
Additional info:
IBM Cloud have a PR open to fix the liveness probe. https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360
Description of problem:
BuildConfig form breaks on manually enter the Git URL after selecting the source type as Git
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Create BuildConfig form page 2. Select source type as Git 3. Enter the git url by typing manually do not paste or select from the suggestion
Actual results:
Console breaks
Expected results:
Console should not break and user should be able tocreate BuildConfig
Additional info:
This package is not used within MAPI, but its presence indicates that the operator needs permissions over VNets, specifically to delete VNets. This is a sensitive permission that if exercised could lead to an unrecoverable cluster, or deletion of other critical infrastructure within the same Azure subscription or resource group that's not related to the cluster itself. This package should be removed as well as the relevant permissions from the CredentialsRequest.
Tracker issue for bootimage bump in 4.18. This issue should block issues which need a bootimage bump to fix.
Description of problem:
gcp destroy fail to acknowledge the deletion of forwarding rules that have already been removed. Did you intend to change the logic here? The new version appears to be ignoring when there is an error of http.StatusNotFound (ie, the thing is already deleted). time="2024-10-03T23:05:47Z" level=debug msg="Listing regional forwarding rules" time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule jstuever28743-9q9lk-api-internal" time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule a36027772a1a948d08721afe4e52fcd4" time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule a36027772a1a948d08721afe4e52fcd4" time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule jstuever28743-9q9lk-api-internal" time="2024-10-03T23:05:47Z" level=debug msg="Listing global forwarding rules" time="2024-10-03T23:05:47Z" level=debug msg="Deleting global forwarding rule a36027772a1a948d08721afe4e52fcd4" time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule jstuever28743-9q9lk-api-internal" time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule a36027772a1a948d08721afe4e52fcd4" time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule jstuever28743-9q9lk-api-internal" time="2024-10-03T23:05:48Z" level=debug msg="Listing target pools" time="2024-10-03T23:05:48Z" level=debug msg="Listing instance groups" time="2024-10-03T23:05:49Z" level=debug msg="Listing target tcp proxies" time="2024-10-03T23:05:49Z" level=debug msg="Listing target tcp proxies" time="2024-10-03T23:05:49Z" level=debug msg="Listing backend services" time="2024-10-03T23:05:49Z" level=debug msg="Listing backend services" time="2024-10-03T23:05:49Z" level=debug msg="Deleting backend service a36027772a1a948d08721afe4e52fcd4" time="2024-10-03T23:05:49Z" level=info msg="Deleted backend service a36027772a1a948d08721afe4e52fcd4" time="2024-10-03T23:05:49Z" level=debug msg="Backend services: 1 global backend service pending"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Looping on destroy
Expected results:
Destroy successful
Additional info:
HIVE team found this bug.
Description of problem:
Creating C2S/SC2S cluster using via CLuster API, got following error: time="2024-05-06T00:57:17-04:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: infrastructure was not ready within 15m0s: timed out waiting for the condition"
Version-Release number of selected component (if applicable):
How reproducible:
4.16.0-0.nightly-2024-05-05-102537
Steps to Reproduce:
1. Install a C2S or an SC2S cluster via Cluster API
Actual results:
See description
Expected results:
Additional info:
Cluster could be created successfully on C2S/SC2S
Description of problem:
On Administrator-> Observe->Dashboards page, click dropdown list for "Time Range" and "Refresh Interval", there is no response. On Observe->Metrics page(for both Administrator and Developer), click dropdown list beside "Actions", it's original "Refresh off", there is no response. There is error “react-dom.production.min.js:101 Uncaught TypeError: r is not a function” in F12 developer console.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-07-200953
How reproducible:
Always
Steps to Reproduce:
1. Refer to description 2. 3.
Actual results:
1. Dropdown list doesn't work well. There is error “react-dom.production.min.js:101 Uncaught TypeError: r is not a function” in F12 developer console.
Expected results:
1. Dropdown list should work fine.
Additional info:
Description of problem:
If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail. Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests: Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Starting around the beginning of June, `-bm` (real baremetal) jobs started exhibiting a high failure rate. OCPBUGS-33255 was mentioned as a potential cause, but this was filed much earlier.
The start date for this is pretty clear in Sippy, chart here:
Example job run:
More job runs
Slack thread:
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1722871253737309
Affecting these tests:
install should succeed: overall install should succeed: cluster creation install should succeed: bootstrap
Description of problem:
Testcase occassionally flakes: --- FAIL: TestRunGraph (1.04s) --- FAIL: TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock (0.01s) task_graph_test.go:943: unexpected error: [context canceled context canceled]
Version-Release number of selected component (if applicable):
Reproducible with current CVO git master revision 00d0940531743e6a0e8bbba151f68c9031bf0df6
How reproducible:
Well with --race and iterations
Steps to Reproduce:
1. go test --count 30 --race ./pkg/payload/...
Actual results:
Some failures
Expected results:
no failures
Additional info:
Seeing this occassionally flake last few months, finally isolated it but I didn't feel like digging into timing test code so I'm at least filing it instead
Description of problem:
On pages under "Observe"->"Alerting", it shows "Not found" when no resources found
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-11-082305
How reproducible:
Steps to Reproduce:
1.Check tabs under "Observe"->"Alerting" when there is not any related resources, eg, "Alerts", "Silence","Alerting rules". 2. 3.
Actual results:
1. 'Not found' is shown under each tab.
Expected results:
1. It's better to show "No <resource> found" like other resources pages. eg: "No Deployments found"
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/162
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In some cases, the tmp files for resolved prepender are not removed on prem platforms.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
When deploying Shift On Stack, check in /tmp and we should not see any tmp.XXX files anymore.
Actual results:
tmp files are there
Expected results:
tmp files are removed when not needed anymore
Please review the following PR: https://github.com/openshift/csi-operator/pull/242
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/118
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/125
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Component Readiness has found a potential regression in the following test:
[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.18
Start Time: 2024-08-14T00:00:00Z
End Time: 2024-08-21T23:59:59Z
Success Rate: 94.89%
Successes: 128
Failures: 7
Flakes: 2
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 647
Failures: 0
Flakes: 15
The test is permafailing on latest payloads on multiple platforms, not just azure. It does seem to coincide with arrival of the 4.18 rhcos images.
{ fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-z5sh003f-431b2-r2nm4-master-0 Unexpected error: <*errors.errorString | 0xc001e80190>: err execing command jq: error (at <stdin>:1): Cannot index array with string "info" jq: error (at <stdin>:1): Cannot iterate over null (null) { s: "err execing command jq: error (at <stdin>:1): Cannot index array with string \"info\"\njq: error (at <stdin>:1): Cannot iterate over null (null)", } occurred Ginkgo exit error 1: exit with code 1}
The script involved is likely in: https://github.com/openshift/origin/blob/a365380cb3a39cfc26b9f28f04b66418c993a879/test/extended/cpu_partitioning/crio.go#L4
Nightly payloads are fully blocked as multiple blocking aggregated jobs are permafailing this test.
Example failed test:
4/1291 Tests Failed.expand_less: user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller in ns/openshift-infra must not produce too many applies {had 7618 applies, check the audit log and operator log to figure out why details in audit log}
Description of problem:
Some references to files did not exist, e.g., `NetworkPolicyListPage` in `console-app` and `functionsComponent` in `knative-plugin`
Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
TestNodePoolReplaceUpgrade e2e test on openstack is expereiencing common failures like this https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/4515/pull-ci-openshift-hypershift-main-e2e-openstack/1849445285156098048 After investigating this failure it looks like the imageRollout on openstack is completed instantly and it gives the noedpool very little time between the node becoming ready and the nodepool status version being set. The short amount of time causes a failure on this check https://github.com/openshift/hypershift/blob/6f6a78b7ff2932087b47609c5a16436bad5aeb1c/test/e2e/nodepool_upgrade_test.go#L166
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Flaky test
Steps to Reproduce:
1. Run the openstack e2e 2. 3.
Actual results:
TestNodePoolReplaceUpgrade fails
Expected results:
TestNodePoolReplaceUpgrade passes
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/221
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
On "Search" page, search resource Node and filter with label, the filter doesn't work. Similarly, click label in "Node selector" field on one mcp detail page, it won't filter out nodes with this label.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-08-024331
How reproducible:
always
Steps to Reproduce:
1. On "Search" page, choose "Node(core/v1)" resource, filter with any label, eg "test=node","node-role.kubernetes.io/worker" 2. On one mcp details page, click label in "Node selector" field on one mcp detail page. 3.
Actual results:
1. Lable filter doesn't work. 2. Nodes are listed without filtered by label.
Expected results:
1. Node should be filtered by label. 2. Should only show nodes with label.
Additional info:
Screenshot: https://drive.google.com/drive/folders/1XZh4MTOzgrzZKIT6HcZ44HFAAip3ENwT?usp=drive_link
slack thread: https://redhat-internal.slack.com/archives/C058TF9K37Z/p1722890745089339?thread_ts=1722872764.429919&cid=C058TF9K37Z
Investigate what happens when machines are deleted when cluster is paused
Description of problem:
The test tries to schedule pods on all workers but fails to schedule on infra nodes Warning FailedScheduling 86s default-scheduler 0/9 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 6 node(s) didn 't match pod anti-affinity rules. preemption: 0/9 nodes are available: 3 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod. $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-b6fns-infra-0-m4v7t Ready infra,worker 19h v1.30.4 ostest-b6fns-infra-0-pllsf Ready infra,worker 19h v1.30.4 ostest-b6fns-infra-0-vnbp8 Ready infra,worker 19h v1.30.4 ostest-b6fns-master-0 Ready control-plane,master 19h v1.30.4 ostest-b6fns-master-2 Ready control-plane,master 19h v1.30.4 ostest-b6fns-master-lmlxf-1 Ready control-plane,master 17h v1.30.4 ostest-b6fns-worker-0-h527q Ready worker 19h v1.30.4 ostest-b6fns-worker-0-kpvdx Ready worker 19h v1.30.4 ostest-b6fns-worker-0-xfcjf Ready worker 19h v1.30.4 Infra nodes should be removed from the worker nodes in the test
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-09-09-173813
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The operator cannot succeed removing resources when networkAccess is set to Removed. It looks like the authorization error changes from bloberror.AuthorizationPermissionMismatch to bloberror.AuthorizationFailure after the storage account becomes private (networkAccess: Internal). This is either caused by weird behavior in the azure sdk, or in the azure api itself. The easiest way to solve it is to also handle bloberror.AuthorizationFailure here: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L1145 The error condition is the following: status: conditions: - lastTransitionTime: "2024-09-27T09:04:20Z" message: "Unable to delete storage container: DELETE https://imageregistrywxj927q6bpj.blob.core.windows.net/wxj-927d-jv8fc-image-registry-rwccleepmieiyukdxbhasjyvklsshhee\n--------------------------------------------------------------------------------\nRESPONSE 403: 403 This request is not authorized to perform this operation.\nERROR CODE: AuthorizationFailure\n--------------------------------------------------------------------------------\n\uFEFF<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>AuthorizationFailure</Code><Message>This request is not authorized to perform this operation.\nRequestId:ababfe86-301e-0005-73bd-10d7af000000\nTime:2024-09-27T09:10:46.1231255Z</Message></Error>\n--------------------------------------------------------------------------------\n" reason: AzureError status: Unknown type: StorageExists - lastTransitionTime: "2024-09-27T09:02:26Z" message: The registry is removed reason: Removed status: "True" type: Available
Version-Release number of selected component (if applicable):
4.18, 4.17, 4.16 (needs confirmation), 4.15 (needs confirmation)
How reproducible:
Always
Steps to Reproduce:
1. Get an Azure cluster 2. In the operator config, set networkAccess to Internal 3. Wait until the operator reconciles the change (watch networkAccess in status with `oc get configs.imageregistry/cluster -oyaml |yq '.status.storage'`) 4. In the operator config, set management state to removed: `oc patch configs.imageregistry/cluster -p '{"spec":{"managementState":"Removed"}}' --type=merge` 5. Watch the cluster operator conditions for the error
Actual results:
Expected results:
Additional info:
Description of problem:
4.17: [VSphereCSIDriverOperator] [Upgrade] VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference UPI installed vsphere cluster upgrade failed caused by CSO degrade Upgrade path: 4.8 -> 4.17
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-10-12-174022
How reproducible:
Always
Steps to Reproduce:
1. Install the OCP cluster on vSphere by UPI with version 4.8. 2. Upgrade the cluster to 4.17 nightly.
Actual results:
In Step 2: The upgrade failed from path 4.16 to 4.17.
Expected results:
In Step 2: The upgrade should be successful.
Additional info:
$ omc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-10-12-102620 True True 1h8m Unable to apply 4.17.0-0.nightly-2024-10-12-174022: wait has exceeded 40 minutes for these operators: storage $ omc get co storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE storage 4.17.0-0.nightly-2024-10-12-174022 True True True 15h $ omc get co storage -oyaml ... status: conditions: - lastTransitionTime: "2024-10-13T17:22:06Z" message: |- VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: panic caught: VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_SyncError status: "True" type: Degraded ... $ omc logs vmware-vsphere-csi-driver-operator-5c7db457-nffp4|tail -n 50 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?}) 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2() 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65 2024-10-13T19:00:02.531545739Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500 2024-10-13T19:00:02.531545739Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9 2024-10-13T19:00:02.534308382Z I1013 19:00:02.532858 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"e44ce388-4878-4400-afae-744530b62281", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Vmware-Vsphere-Csi-Driver-OperatorPanic' Panic observed: runtime error: invalid memory address or nil pointer dereference 2024-10-13T19:00:03.532125885Z E1013 19:00:03.532044 1 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors: 2024-10-13T19:00:03.532125885Z line 1: cannot unmarshal !!seq into config.CommonConfigYAML 2024-10-13T19:00:03.532498631Z I1013 19:00:03.532460 1 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config. 2024-10-13T19:00:03.532708025Z I1013 19:00:03.532571 1 config.go:283] Config initialized 2024-10-13T19:00:03.533270439Z E1013 19:00:03.533160 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) 2024-10-13T19:00:03.533270439Z goroutine 701 [running]: 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2cf3100, 0x54fd210}) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0014c54e8, 0x1, 0xc000e7e1c0?}) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b 2024-10-13T19:00:03.533270439Z panic({0x2cf3100?, 0x54fd210?}) 2024-10-13T19:00:03.533270439Z runtime/panic.go:770 +0x132 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).createVCenterConnection(0xc0008b2788, {0xc0022cf600?, 0xc0014c57c0?}, 0xc0006a3448) 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:491 +0x94 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).loginToVCenter(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, 0x3377a7c?) 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:446 +0x5e 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).sync(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, {0x38ee700, 0xc0011d08d0}) 2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:240 +0x6fc 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}, {0x38ee700?, 0xc0011d08d0?}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:201 +0x43 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:260 +0x1ae 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1({0x3900f30, 0xc0000b9ae0}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:192 +0x89 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x1f 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002bb1e80?) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:226 +0x33 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0014c5f10, {0x38cf7e0, 0xc00142b470}, 0x1, 0xc0013ae960) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:227 +0xaf 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00115bf10, 0x3b9aca00, 0x0, 0x1, 0xc0013ae960) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:204 +0x7f 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x3900f30, 0xc0000b9ae0}, 0xc00115bf70, 0x3b9aca00, 0x0, 0x1) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x93 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...) 2024-10-13T19:00:03.533270439Z k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:170 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?}) 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2() 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65 2024-10-13T19:00:03.533270439Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500 2024-10-13T19:00:03.533270439Z github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9
Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.
This is a clear regression and it is only present on 4.17, not 4.16. It is present across all platforms, though I've selected AWS for links and screenshots.
slack thread if there are questions
courtesy screen shot
Our CI job is currently down with an error in CAPO (our CAPI provider): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-openstack-conformance/1855132754782457856/artifacts/e2e-openstack-conformance/dump/artifacts/namespaces/clusters-0183964f0514bc3aee5c/core/pods/logs/capi-provider-5988b8b87c-q5zwq-manager.log
We are missing a CRD and we probably need to add it
Description of problem:
After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear. // after changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws classicLoadBalancer: <<<< connectionIdleTimeout: 0s <<<< networkLoadBalancer: {} type: NLB // create new ingresscontroller with NLB $ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws networkLoadBalancer: {} type: NLB
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133
How reproducible:
100%
Steps to Reproduce:
1. changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}' 2. create new ingresscontroller with NLB kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: nlb namespace: openshift-ingress-operator spec: domain: nlb.<base-domain> replicas: 1 endpointPublishingStrategy: loadBalancer: providerParameters: aws: type: NLB type: AWS scope: External type: LoadBalancerService 3. check both ingresscontrollers status
Actual results:
// after changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws classicLoadBalancer: connectionIdleTimeout: 0s networkLoadBalancer: {} type: NLB // new ingresscontroller with NLB $ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws networkLoadBalancer: {} type: NLB
Expected results:
If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB.
Additional info:
Description of problem:
Compare with the same behavior on OCP 4.17. The function of 'shortname seach' on OCP 4.18 is not working
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-16-094159
How reproducible:
Always
Steps to Reproduce:
1. Create a CRD resource with code https://github.com/medik8s/fence-agents-remediation/blob/main/config/crd/bases/fence-agents-remediation.medik8s.io_fenceagentsremediationtemplates.yaml 2. Navigate to Home -> Search page 3. Use Shortname 'FAR' to search the created resource 'FenceAgentsRemediationTemplates' 4. Search the resource with shortname 'AM' for example
Actual results:
3. No result was found will return 4. The first result list on dropdown list is 'Config (sample.operator.openshit)', which is incorrect
Expected results:
3. the Resource 'FenceAgentsRemediationTemplates' should list on the dropdown list 4. The first result list on dropdown list should be 'Alertmanager'
Additional info:
Please review the following PR: https://github.com/openshift/frr/pull/64
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
{ fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:134]: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta3.flowcontrol.apiserver.k8s.io 6 times
All jobs failed on https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-upgrade-4.18-minor-release-openshift-release-analysis-aggregator/1846018782808510464
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1283
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/319
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-azure-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
When deploying 4.16, customer identified an inbound rule security risk for the "node" security group allowing access from 0.0.0.0/0 to node port range 30000-32767. This issue did not exist in versions prior to 4.16 and suspect this may be a regression. It seems to be related to the use of CAPI which could have changed the behavior. Trying to understand why this was allowed.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Install 4.16 cluster *** On 4.12 installations, this is not the case ***
Actual results:
The installer configures an inbound rule for the node security group allowing access from 0.0.0.0/0 for port range 30000-32767.
Expected results:
The installer should *NOT* create an inbound security rule allowing access to node port range 30000-32767 from any CIDR range (0.0.0.0/0)
Additional info:
#forum-ocp-cloud slack discussion: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1728484197441409
Relevant Code :
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v2.4.0/pkg/cloud/services/securitygroup/securitygroups.go#L551
Description of problem:
Despite passing in '--attach-default-network false', the nodepool still has attachDefaultNetwork: true
hcp create cluster kubevirt --name ocp-lab-int-6 --base-domain paas.com --cores 6 --memory 64Gi --additional-network "name:default/ppcore-547" --attach-default-network false --cluster-cidr 100.64.0.0/20 --service-cidr 100.64.16.0/20 --network-type OVNKubernetes --node-pool-replicas 3 --ssh-key ~/deploy --pull-secret pull-secret.txt --release-image quay.io/openshift-release-dev/ocp-release:4.16.18-x86_64 platform: kubevirt: additionalNetworks: - name: default/ppcore-547 attachDefaultNetwork: true
Version-Release number of selected component (if applicable):
Client Version: openshift/hypershift: b9e977da802d07591cd9fb8ad91ba24116f4a3a8. Latest supported OCP: 4.17.0 Server Version: b9e977da802d07591cd9fb8ad91ba24116f4a3a8 Server Supports OCP Versions: 4.17, 4.16, 4.15, 4.14
How reproducible:
Steps to Reproduce:
1. hcp install as per the above 2. 3.
Actual results:
The default network is attached
Expected results:
No default network
Additional info:
Description of problem:
When running on a FIPS-enabled cluster, the e2e test TestFirstBootHasSSHKeys times out.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open a PR to the MCO repository. 2. Run the e2e-aws-ovn-fips-op job by commenting /test e2e-aws-ovn-fips-op (this job does not run automatically). 3. Eventually, the test will fail.
Actual results:
=== RUN TestFirstBootHasSSHKeys1065mcd_test.go:1019: did not get new node --- FAIL: TestFirstBootHasSSHKeys (1201.83s)
Expected results:
=== RUN TestFirstBootHasSSHKeys mcd_test.go:929: Got ssh key file data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX --- PASS: TestFirstBootHasSSHKeys (334.86s)
Additional info:
It looks like we're hitting a 20-minute timeout during the test. By comparison, the passing case seems to execute in approximately 5.5 minutes. I have two preliminary hypothesis' for this: 1. This operation takes longer in FIPS-enabled clusters for some reason. 2. It is possible that this is occurring due to a difference in which cloud these tests run. Our normal e2e-gcp-op tests run in GCP whereas this test suite runs in AWS. The underlying operations performed by the Machine API may just take longer in AWS than they do in GCP. If that is the case, this bug can be resolved as-is.
Must-Gather link: https://drive.google.com/file/d/12GhTIP9bgcoNje0Jvyhr-c-akV3XnGn2/view?usp=sharing
Error from SNYK code:
✗ [High] Cross-site Scripting (XSS) Path: ignition-server/cmd/start.go, line 250 Info: Unsanitized input from an HTTP header flows into Write, where it is used to render an HTML page returned to the user. This may result in a Reflected Cross-Site Scripting attack (XSS).
Enabling FIP results in an error during machine-os-images /bin/copy-iso
/bin/copy-iso: line 29: [: missing `]'
We need the ability to define a different image for the iptables CLI image because its located in the dataplane
Description of problem:
namespace value in Ingres details page is incorrect
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-10-234322
How reproducible:
Always
Steps to Reproduce:
1. Create a sample ingress into default namespace 2. Navigate to Networking -> Ingresses -> Ingresses details page /k8s/ns/default/ingresses/<ingress sample name> 3. Check the Namespace value
Actual results:
it shown the Ingress name which is incorrect
Expected results:
it should update to the name store in metadata.namespace
Additional info:
Description of problem:
In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared. This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared: https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden) https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden) It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared. This bug prevents users from successfully creating instances from templates in the WebConsole.
Version-Release number of selected component (if applicable):
4.15 4.14
How reproducible:
YES
Steps to Reproduce:
1. Log in with a non-administrator account. 2. Select a template from the developer catalog and click on Instantiate Template. 3. Enter values into the initially empty form. 4. Wait for several seconds, and the entered values will disappear.
Actual results:
Entered values are disappeard
Expected results:
Entered values are appeard
Additional info:
I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.
Router pods use the "hostnetwork" SCC even when they do not use the host network.
All versions of OpenShift from 4.11 through 4.17.
100%.
1. Install a new cluster with OpenShift 4.11 or later on a cloud platform.
The router-default pods do not use the host network, yet they use the "hostnetwork" SCC:
% oc -n openshift-ingress get pods -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o go-template --template='{{range .items}}{{.metadata.name}} {{with .metadata.annotations}}{{index . "openshift.io/scc"}}{{end}} {{.spec.hostNetwork}}{{"\n"}}{{end}}' router-default-5ffd4ff7cd-mhhv6 hostnetwork <no value> router-default-5ffd4ff7cd-wmqnj hostnetwork <no value> %
The router-default pods should use the "restricted" SCC.
We missed this change from the OCP 4.11 release notes:
The restricted SCC is no longer available to users of new clusters, unless the access is explicitly granted. In clusters originally installed in OpenShift Container Platform 4.10 or earlier, all authenticated users can use the restricted SCC when upgrading to OpenShift Container Platform 4.11 and later.
Artifacts from CI jobs confirm that router pods used "restricted" for new 4.10 clusters and for 4.10→4.11 upgraded clusters, and "hostnetwork" for new 4.11 clusters:
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1790552355406614528/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "restricted" "restricted" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1790422949342220288/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "restricted" "restricted" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1793013806733987840/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "restricted" "restricted" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1793013781534609408/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade/1793670820518694912/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1793670819998601216/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1793062832263139328/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" %
Description of problem:
The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations. The attached spreadsheet displays the combinations of valid disk and instance types.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/296
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
1. We are making 2 API calls to get the logs for the PipelineRuns. instead, we can make use of `results.tekton.dev/record` annotation and replace the `records` in the value of the annotation with `logs` to get the logs of the PipelineRuns. 2. Tekton results will return back only v1 version of PipelineRun and TaskRun from Pipelines 1.16, so data type has to be v1 version for 1.16 version and for lower version it is v1beta1
Description of problem:
documentationBaseURL still points to 4.17
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-23-112324
How reproducible:
Always
Steps to Reproduce:
1. check documentationBaseURL on a 4.18 cluster $ oc get cm console-config -n openshift-console -o yaml | grep documentationon documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.17/ 2. 3.
Actual results:
documentationBaseURL still links to 4.17
Expected results:
documentationBaseURL should link to 4.18
Additional info:
Description of the problem:
Unbinding s390x (Z) hosts no longer reboots them into discovery. Instead the reclaim agent runs on the node and continuously reboots them.
How reproducible:
Steps to reproduce:
1. Boot Z hosts with discovery image and install them to a cluster (original issue did so with hypershift)
2. Unbind the hosts from the cluster (original issue scaled down nodepool) and watch as the hosts constantly reboot (not into discovery)
Actual results:
Hosts are not reclaimed, unbound, and ready to be used again. Instead they are stuck and constantly reboot.
Expected results:
Hosts are unbound and ready to be used.
Additional information
Contents of RHCOS boot config files
# cat ostree-1-rhcos.conf
title Red Hat Enterprise Linux CoreOS 415.92.202311241643-0 (Plow) (ostree:1)
version 1
options ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/0 root=UUID=36ac8acd-bf01-40e4-8043-3682716e3b91 rw rootflags=prjquota boot=UUID=879d4744-c4b2-4cd3-a4a3-ca601d7dadd7
linux /ostree/rhcos-5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/vmlinuz-5.14.0-284.41.1.el9_2.s390x
initrd /ostree/rhcos-5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/initramfs-5.14.0-284.41.1.el9_2.s390x.img
aboot /ostree/deploy/rhcos/deploy/01b96f07863b8bf16cb4e9a187fefe5bcc1b443a825a503355a1f658a2e856d7.0/usr/lib/ostree-boot/aboot.img
abootcfg /ostree/deploy/rhcos/deploy/01b96f07863b8bf16cb4e9a187fefe5bcc1b443a825a503355a1f658a2e856d7.0/usr/lib/ostree-boot/aboot.cfg7:51
$ cat ostree-2-rhcos.conf
title Red Hat Enterprise Linux CoreOS 415.92.202312250243-0 (Plow) (ostree:0)
version 2
options ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/0 root=UUID=36ac8acd-bf01-40e4-8043-3682716e3b91 rw rootflags=prjquota boot=UUID=879d4744-c4b2-4cd3-a4a3-ca601d7dadd7 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1
linux /ostree/rhcos-1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/vmlinuz-5.14.0-284.45.1.el9_2.s390x
initrd /ostree/rhcos-1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/initramfs-5.14.0-284.45.1.el9_2.s390x.img
aboot /ostree/deploy/rhcos/deploy/90229475c67473a16f77b3679a5b7a3d90d268d70adf24668f14cf00c06d83e5.1/usr/lib/ostree-boot/aboot.img
abootcfg /ostree/deploy/rhcos/deploy/90229475c67473a16f77b3679a5b7a3d90d268d70adf24668f14cf00c06d83e5.1/usr/lib/ostree-boot/aboot.cfg
Interesting journal log
Feb 15 16:51:07 localhost kernel: Kernel command line: ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842> Feb 15 16:51:07 localhost kernel: Unknown kernel command line parameters "ostree=/ostree/boot.1/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c>
See attached images for reclaim files
Please review the following PR: https://github.com/openshift/azure-kubernetes-kms/pull/8
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This is a placeholder for Hypershift PR(s) related to bumping CAPO to v0.11.0.
Description of problem:
When using an amd64 release image and setting the multi-arch flag to false, HCP CLI cannot create a HostedCluster. The following error happens: /tmp/hcp create cluster aws --role-arn arn:aws:iam::460538899914:role/cc1c0f586e92c42a7d50 --sts-creds /tmp/secret/sts-creds.json --name cc1c0f586e92c42a7d50 --infra-id cc1c0f586e92c42a7d50 --node-pool-replicas 3 --base-domain origin-ci-int-aws.dev.rhcloud.com --region us-east-1 --pull-secret /etc/ci-pull-credentials/.dockerconfigjson --namespace local-cluster --release-image registry.build01.ci.openshift.org/ci-op-0bi6jr1l/release@sha256:11351a958a409b8e34321edfc459f389058d978e87063bebac764823e0ae3183 2024-08-29T06:23:25Z ERROR Failed to create cluster {"error": "release image is not a multi-arch image"} github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1 /remote-source/app/product-cli/cmd/cluster/aws/create.go:35 github.com/spf13/cobra.(*Command).execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /remote-source/app/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /remote-source/app/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /remote-source/app/vendor/github.com/spf13/cobra/command.go:1032 main.main /remote-source/app/product-cli/main.go:59 runtime.main /usr/lib/golang/src/runtime/proc.go:271 Error: release image is not a multi-arch image release image is not a multi-arch image
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Try to create a HC with an amd64 release image and multi-arch flag set to false
Actual results:
HC does not create and this error is displayed: Error: release image is not a multi-arch image release image is not a multi-arch image
Expected results:
HC should create without errors
Additional info:
This bug seems to have occurred as a result of HOSTEDCP-1778 and this line: https://github.com/openshift/hypershift/blob/e2f75a7247ab803634a1cc7f7beaf99f8a97194c/cmd/cluster/aws/create.go#L520
Description of problem:
The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The "oc adm node-image create" command sometimes throw a "image can't be pulled" error the first time the command is executed against a cluster. Example: +(./agent/07_agent_add_node.sh:138): case "${AGENT_E2E_TEST_BOOT_MODE}" in +(./agent/07_agent_add_node.sh:42): oc adm node-image create --dir ocp/ostest/add-node/ --registry-config /opt/dev-scripts/pull_secret.json --loglevel=2 I1108 05:09:07.504614 85927 create.go:406] Starting command in pod node-joiner-4r4hq I1108 05:09:07.517491 85927 create.go:826] Waiting for pod **snip** I1108 05:09:39.512594 85927 create.go:826] Waiting for pod I1108 05:09:39.512634 85927 create.go:322] Printing pod logs Error from server (BadRequest): container "node-joiner" in pod "node-joiner-4r4hq" is waiting to start: image can't be pulled
Version-Release number of selected component (if applicable):
4.18
How reproducible:
sometimes
Steps to Reproduce:
1. Install a new cluster 2. Run "oc adm node-image create" to create an image 3.
Actual results:
Error from server (BadRequest): container "node-joiner" in pod "node-joiner-4r4hq" is waiting to start: image can't be pulled
Expected results:
No errors
Additional info:
The error occurs the first the the command is executed. If one retry running the command again, it succeeds.
Description of problem:
Nodes couldn't recover when missing worker role in the custom mcp, all of the configuration missed in the node, the kubelet and crio services couldn't start.
Version-Release number of selected component (if applicable):
OCP 4.14
How reproducible:
Steps to Reproduce:
1. Create a custom MCP without worker role
$ cat mc.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker-t
generation: 3
name: 80-user-kernal
spec: {}
$ cat mcp.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-t
spec:
configuration:
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker-t
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-t: ""
$ oc create -f mc.yaml
$ oc create -f mcp.yaml
2. Add label worker-t to worker03
$ oc get no
NAME STATUS ROLES AGE VERSION
master01.ocp4.danliu.com Ready master 454d v1.27.13+e709aa5
master02.ocp4.danliu.com Ready master 453d v1.27.13+e709aa5
master03.ocp4.danliu.com Ready master 453d v1.27.13+e709aa5
worker01.ocp4.danliu.com Ready worker 453d v1.27.13+e709aa5
worker02.ocp4.danliu.com Ready worker 51d v1.27.13+e709aa5
worker03.ocp4.danliu.com Ready worker,worker-t 69d v1.27.13+e709aa5
$ oc label nodes worker03.ocp4.danliu.com node-role.kubernetes.io/worker-t=
node/worker03.ocp4.danliu.com labeled
Actual results:
worker03 run into NotReady status, kubelet and crio couldn't startup.
Expected results:
Prevent to sync up the mc when missing worker role
Additional info:
In the previous version (4.13 & 4.12), the task stuck with below error:
Marking Unreconcilable due to: can't reconcile config rendered-worker-8f464eb07d2e2d2fbdb84ab2204fea65 with rendered-worker-t-5b6179e2fb4fedb853c900504edad9ce: ignition passwd user section contains unsupported changes: user core may not be deleted
Description of problem:
Customer is unable to scale deploymentConfig in RHOCP 4.14.21 cluster. If they scale a DeploymentConfig they get the error: "New size: 4; reason: cpu resource utilization (percentage of request) above target; error: Internal error occurred: converting (apps.DeploymentConfig) to (v1beta1.Scale): unknown conversion"
Version-Release number of selected component (if applicable):
4.14.21
How reproducible:
N/A
Steps to Reproduce:
1. deploy apps using DC 2. configure an admission webhook matching the dc/scale subresource 3. create HPA 4. observe pods unable to scale. Also manual scaling fails
Actual results:
Pods are not getting scaled
Expected results:
Pods should be scaled using HPA
Additional info:
Description of problem:
Additional IBM Cloud Services require the ability to override their service endpoints within the Installer. The list of available services provided in openshift/api must be expanded to account for this.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create an install-config for IBM Cloud 2. Define serviceEndpoints, including one for "resourceCatalog" 3. Attempt to run IPI
Actual results:
Expected results:
Successful IPI installation, using additional IBM Cloud Service endpoint overrides.
Additional info:
IBM Cloud is working on multiple patches to incorporate these additional services. The full list is still a work in progress, but currently includes: - Resource (Global) Catalog endpoint - COS Config endpoint Changes are required in the follow components currently. May open separate Jira's (if required) to track their progress. - openshift/api - openshift-installer - openshift/cluster-image-registry-operator
Description of problem:
When we add a userCA bundle to a cluster that has MCPs with yum based rhel nodes, the MCP with rhel nodes are degraded.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.17.0-0.nightly-2024-08-18-131731 True False 101m Cluster version is 4.17.0-0.nightly-2024-08-18-131731
How reproducible:
Always In the CI we found this issue running test case "[sig-mco] MCO security Author:sregidor-NonHyperShiftHOST-High-67660-MCS generates ignition configs with certs [Disruptive] [Serial]" on prow job periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-workers-rhel8-fips-f28-destructive
Steps to Reproduce:
1. Create a certificate $ openssl genrsa -out privateKey.pem 4096 $ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com" 2. Add the certificate to the cluster # Create the configmap with the certificate $ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt configmap/cm-test-cert created #Configure the proxy with the new test certificate $ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}' proxy.config.openshift.io/cluster patched 3. Check the MCP status and the MCD logs
Actual results:
The MCP is degraded $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-3251b00997d5f49171e70f7cf9b64776 True False False 3 3 3 0 130m worker rendered-worker-05e7664fa4758a39f13a2b57708807f7 False True True 3 0 0 1 130m We can see this message in the MCP - lastTransitionTime: "2024-08-19T11:00:34Z" message: 'Node ci-op-jr7hwqkk-48b44-6mcjk-rhel-1 is reporting: "could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.\n: exit status 5"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded In the MCD logs we can see: I0819 11:38:55.089991 7239 update.go:2665] Removing SIGTERM protection E0819 11:38:55.090067 7239 writer.go:226] Marking Degraded due to: could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.
Expected results:
No degradation should happen. The certificate should be added without problems.
Additional info:
Description of problem:
when cluster-admin user or normal user tries to create the first networkpolicy resource for one project, click on `affected pods` before submitting the creation form will result in error
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-27-051932
How reproducible:
Always
Steps to Reproduce:
1. Open Networking -> NetworkPolicies, normal user or cluster-admin user tries to create the first networkpolicy resource into one project 2. on Form view, click on `affected pods` button before hit on 'Create' button 3.
Actual results:
2. For cluster-admin user, we will see error Cannot set properties of undefined (setting 'tabIndex') For normal user, we will see undefined has no properties
Expected results:
no errors
Additional info:
Description of problem:
HCP cluster is being updated but the nodepool is stuck updating: ~~~ NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE nodepool-dev-cluster dev 2 2 False False 4.15.22 True True ~~~
Version-Release number of selected component (if applicable):
Hosting OCP cluster 4.15 HCP 4.15.23
How reproducible:
N/A
Steps to Reproduce:
1. 2. 3.
Actual results:
Nodepool stuck in upgrade
Expected results:
Upgrade success
Additional info:
I have found this error repeating continually in the ignition-server pods: ~~~ {"level":"error","ts":"2024-08-20T09:02:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-nodepool-dev-cluster-3146da34","namespace":"dev-dev"},"namespace":"dev-dev","name":"token-nodepool-dev-cluster-3146da34","reconcileID":"ec1f0a7f-1657-4245-99ef-c984977ff0f8","error":"error getting ignition payload: failed to download binaries: failed to extract image file: failed to extract image file: file not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"discovered machine-config-operator image","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede"} {"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"created working directory","dir":"/payloads/get-payload4089452863"} {"level":"info","ts":"2024-08-20T09:02:28Z","logger":"get-payload","msg":"extracted image-references","time":"8s"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"extracted templates","time":"10s"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"image-cache","msg":"retrieved cached file","imageRef":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede","file":"usr/lib/os-release"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"read os-release","mcoRHELMajorVersion":"8","cpoRHELMajorVersion":"9"} {"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"copying file","src":"usr/bin/machine-config-operator.rhel9","dest":"/payloads/get-payload4089452863/bin/machine-config-operator"} ~~~
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/118
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-machine-api-provider-azure-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
Since about 4 days ago, the techpreview jobs have been failing on MCO namespace: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.18/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial%22%7D%5D%7D Example run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843057579794632704 The daemons appear to be applying MCN's too early in the process, which causes it to degrade for a few loops: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1842877807659585536/artifacts/e2e-aws-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-79f7s_machine-config-daemon.log This is semi-blocking techpreview jobs and should be fixed high priority. This shouldn't be blocking release as MCN is not GA and likely won't be in 4.18.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.
Version-Release number of selected component (if applicable):
How reproducible:
Difficult to reproduce, might require CI signal
Description of problem:
Console and OLM engineering and BU have decided to remove the Extension Catalog navigation item until the feature has matured more.
Description of problem:
cluster-openshift-apiserver-operator is still in 1.29 and should be updated to 1.30 to reduce conflicts and other issues
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a part of deploying SNO clusters in the field based on the IBI install process we need a way to apply NODE labels to the resulting cluster. As an example, once the cluster has had an IBI config applied to it, it should have a node label of "edge.io/isedgedevice: true" ... the label is only an example, and the user should have the ability to add one or more labels to the resulting node.
See: https://redhat-internal.slack.com/archives/C05JHD9QYTC/p1730298666011899 for additional context.
Description of problem:
While accessing the node terminal of the cluster from web-console the below warning message observed. ~~~ Admission Webhook WarningPod master-0.americancluster222.lab.psi.pnq2.redhat.com-debug violates policy 299 - "metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]" ~~~ Note: This is not impacting the cluster. However creating confusion among customers due to the warning message.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Everytime.
Steps to Reproduce:
1. Install cluster of version 4.16.11 2. Upgrade the cluster from web-console to the next-minor version 4.16.13 3. Try to access the node terminal from UI
Actual results:
Showing warning while accessing the node terminal.
Expected results:
Does not show any warning.
Additional info:
Please review the following PR: https://github.com/openshift/hypershift/pull/4672
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The pod of catalogsource without registryPoll wasn't recreated during the node failure
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 7m6s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 116m v1.30.2+421e90e
Version-Release number of selected component (if applicable):
Cluster version is 4.17.0-0.nightly-2024-07-07-131215
How reproducible:
always
Steps to Reproduce:
1. create a catalogsource without the registryPoll configure. jiazha-mac:~ jiazha$ cat cs-32183.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test namespace: openshift-marketplace spec: displayName: Test Operators image: registry.redhat.io/redhat/redhat-operator-index:v4.16 publisher: OpenShift QE sourceType: grpc jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml catalogsource.operators.coreos.com/test created jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 3m18s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> 2. Stop the node jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc Temporary namespace openshift-debug-q4d5k is created for debugging node... Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.5 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet Removing debug pod ... Temporary namespace openshift-debug-q4d5k was removed. jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 115m v1.30.2+421e90e 3. check it this catalogsource's pod recreated.
Actual results:
No new pod was generated.
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m
once node recovery, a new pod was generated.
jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME STATUS ROLES AGE VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc Ready worker 127m v1.30.2+421e90e
jiazha-mac:~ jiazha$ oc get pods
NAME READY STATUS RESTARTS AGE
certified-operators-rcs64 1/1 Running 0 127m
community-operators-8mxh6 1/1 Running 0 127m
marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (121m ago) 140m
qe-app-registry-5jxlx 1/1 Running 0 109m
redhat-marketplace-4bgv9 1/1 Running 0 127m
redhat-operators-ww5tb 1/1 Running 0 127m
test-wqxvg 1/1 Running 0 27s
Expected results:
During the node failure, a new catalog source pod should be generated.
Additional info:
Hi Team,
After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc
And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc updateStrategy: <== registryPoll: <== interval: 10m <==
The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.
[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html
Observed in
there was a delay provisioning one of the master nodes, we should figure out why this is happening and if it can be prevented
from the ironic logs, there was a 5 minute delay during cleaning, on the other 2 masters this too a few seconds
01:20:53 1f90131a...moved to provision state "verifying" from state "enroll" 01:20:59 1f90131a...moved to provision state "manageable" from state "verifying" 01:21:04 1f90131a...moved to provision state "inspecting" from state "manageable" 01:21:35 1f90131a...moved to provision state "inspect wait" from state "inspecting" 01:26:26 1f90131a...moved to provision state "inspecting" from state "inspect wait" 01:26:26 1f90131a...moved to provision state "manageable" from state "inspecting" 01:26:30 1f90131a...moved to provision state "cleaning" from state "manageable" 01:27:17 1f90131a...moved to provision state "clean wait" from state "cleaning" >>> whats this 5 minute gap about ?? <<< 01:32:07 1f90131a...moved to provision state "cleaning" from state "clean wait" 01:32:08 1f90131a...moved to provision state "clean wait" from state "cleaning" 01:32:12 1f90131a...moved to provision state "cleaning" from state "clean wait" 01:32:13 1f90131a...moved to provision state "available" from state "cleaning" 01:32:23 1f90131a...moved to provision state "deploying" from state "available" 01:32:28 1f90131a...moved to provision state "wait call-back" from state "deploying" 01:32:58 1f90131a...moved to provision state "deploying" from state "wait call-back" 01:33:14 1f90131a...moved to provision state "active" from state "deploying"
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible: Always
Repro Steps:
Add: "bridge=br0:enpf0,enpf2 ip=br0:dhcp" to dracut cmdline. Make sure either enpf0/enpf2 is the primary network of the cluster subnet.
The linux bridge can be configured to add a virtual switch between one or many ports. This can be done by a simple machine config that adds:
"bridge=br0:enpf0,enpf2 ip=br0:dhcp"
to the the kernel command line options which will be processed by dracut.
The use case of adding such a virtual bridge for simple IEEE802.1 switching is to support PCIe devices that act as co-processors in a baremetal server. For example:
-------- ---------------------
Host | PCIe | Co-processor | |
eth0 | <-------> | enpf0 < |
<---> network |
-------- ---------------------
This co-processor could be a "DPU" network interface card. Thus the co-processor can be part of the same underlay network as the cluster and pods can be scheduled on the Host and the Co-processor. This allows for pods to be offloaded to the co-processor for scaling workloads.
Actual results:
ovs-configuration service fails.
Expected results:
ovs-configuration service passes with the bridge interface added to the ovs bridge.
Description of problem:
v4.17 baselineCapabilitySet is not recognized. # ./oc adm release extract --install-config v4.17-basecap.yaml --included --credentials-requests --from quay.io/openshift-release-dev/ocp-release:4.17.0-rc.1-x86_64 --to /tmp/test error: unrecognized baselineCapabilitySet "v4.17" # cat v4.17-basecap.yaml --- apiVersion: v1 platform: gcp: foo: bar capabilities: baselineCapabilitySet: v4.17
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-04-132247
How reproducible:
always
Steps to Reproduce:
1. Run `oc adm release extract --install-config --included` against an install-config file including baselineCapabilitySet: v4.17. 2. 3.
Actual results:
`oc adm release extract` throw unrecognized error
Expected results:
`oc adm release extract` should extract correct manifests
Additional info:
If specifying baselineCapabilitySet: v4.16, it works well.
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
Context
Some ROSA HCP users host their own container registries (e.g., self-hosted Quay servers) that are only accessible from inside of their VPCs. This is often achieved through the use of private DNS zones that resolve non-public domains like quay.mycompany.intranet to non-public IP addresses. The private registries at those addresses then present self-signed SSL certificates to the client that can be validated against the HCP's additional CA trust bundle.
Problem Description
A user of a ROSA HCP cluster with a configuration like the one described above is encountering errors when attempting to import a container image from their private registry into their HCP's internal registry via oc import-image. Originally, these errors showed up in openshift-apiserver logs as DNS resolution errors, i.e., OCPBUGS-36944. After the user upgraded their cluster to 4.14.37 (which fixes OCPBUGS-36944), openshift-apiserver was able to properly resolve the domain name but complains of HTTP 502 Bad Gateway errors. We suspect these 502 Bad Gateway errors are coming from the Konnectivity-agent while it proxies traffic between the control and data planes.
We've confirmed that the private registry is accessible from the HCP data plane (worker nodes) and that the certificate presented by the registry can be validated against the cluster's additional trust bundle. IOW, curl-ing the private registry from a worker node returns a HTTP 200 OK, but doing the same from a control plane node returns a HTTP 502. Notably, this cluster is not configured with a cluster-wide proxy, nor does the user's VPC feature a transparent proxy.
Version-Release number of selected component
OCP v4.14.37
How reproducible
Can be reliably reproduced, although the network config (see Context above) is quite specific
Steps to Reproduce
oc import-image imagegroup/imagename:v1.2.3 --from=quay.mycompany.intranet/imagegroup/imagename:v1.2.3 --confirm
Actual Results
error: tag v1.2.3 failed: Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway imagestream.image.openshift.io/imagename imported with errors Name: imagename Namespace: mynamespace Created: Less than a second ago Labels: <none> Annotations: openshift.io/image.dockerRepositoryCheck=2024-10-01T12:46:02Z Image Repository: default-route-openshift-image-registry.apps.rosa.clustername.abcd.p1.openshiftapps.com/mynamespace/imagename Image Lookup: local=false Unique Images: 0 Tags: 1 v1.2.3 tagged from quay.mycompany.intranet/imagegroup/imagename:v1.2.3 ! error: Import failed (InternalError): Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway Less than a second ago error: imported completed with errors
Expected Results
Desired container image is imported from private external image registry into cluster's internal image registry without error
Description of problem:
We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We should decrease the verbosity level for the IBM CAPI module. This will affect the output of the file .openshift_install.log
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/196
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While updating an HC with controllerAvailabilityPolicy of SingleReplica, the HCP doesn't fully rollout with 3 pod stuck in Pending
multus-admission-controller-5b5c95684b-v5qgd 0/2 Pending 0 4m36s network-node-identity-7b54d84df4-dxx27 0/3 Pending 0 4m12s ovnkube-control-plane-647ffb5f4d-hk6fg 0/3 Pending 0 4m21s
This is because these deployment all have requiredDuringSchedulingIgnoredDuringExecution zone anti-affinity and maxUnavailable: 25% (i.e. 1)
Thus the old pod blocks scheduling of the new pod.
Kubelet logs contain entries like:
Jun 13 10:05:14.141073 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:14.141043 1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
I'm not sure if that's a problem or not, but it is distracting noise for folks trying to understand Kubelet behavior, and we should either fix the problem, or denoise the red-herring.
Seen in 4.13.44, 4.14.31, and 4.17.0-0.nightly-2024-06-25-162526 (details in Additional info).
Not seen in 4.12.60, so presumably a 4.12 to 4.13 change.
Every time.
1. Run a cluster.
2. Check node/kubelet logs for one control-plane node.
Lots of can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt messages.
No can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt messages.
Checking recent builds in assorted 4.y streams.
4.12.60 > aws-sdn-serial > Artifacts > ... > gather-extra artifacts:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1803708035177123840/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name' ip-10-0-156-214.us-west-1.compute.internal ip-10-0-158-171.us-west-1.compute.internal ip-10-0-203-59.us-west-1.compute.internal $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1803708035177123840/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-156-214.us-west-1.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3 Jun 20 08:47:07.734060 ip-10-0-156-214 ignition[1087]: INFO : files: createFilesystemsFiles: createFiles: op(11): [finished] writing file "/sysroot/etc/kubernetes/kubelet-ca.crt" Jun 20 08:49:29.274949 ip-10-0-156-214 kubenswrapper[1384]: I0620 08:49:29.274923 1384 dynamic_cafile_content.go:119] "Loaded a new CA Bundle and Verifier" name="client-ca-bundle::/etc/kubernetes/kubelet-ca.crt" Jun 20 08:49:29.275084 ip-10-0-156-214 kubenswrapper[1384]: I0620 08:49:29.275067 1384 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"
is clean.
4.13.44 > aws-sdn-serial > Artifacts > ... > gather-extra artifacts:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1801188570212339712/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name' ip-10-0-133-167.us-west-1.compute.internal ip-10-0-170-3.us-west-1.compute.internal ip-10-0-203-13.us-west-1.compute.internal $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1801188570212339712/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-133-167.us-west-1.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3 Jun 13 10:05:00.464260 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:00.464190 1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt" Jun 13 10:05:13.320867 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:13.320824 1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt" Jun 13 10:05:14.141073 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:14.141043 1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
is exposed.
4.14.31 > aws-ovn-serial > Artifacts > ... > gather-extra artifacts:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1803746771264868352/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name' ip-10-0-17-181.us-west-2.compute.internal ip-10-0-66-68.us-west-2.compute.internal ip-10-0-97-83.us-west-2.compute.internal $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1803746771264868352/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes/ip-10-0-17-181.us-west-2.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3 Jun 20 11:42:31.931470 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:31.931404 2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt" Jun 20 11:42:31.980499 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:31.980448 2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt" Jun 20 11:42:32.757888 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:32.757846 2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
4.17.0-0.nightly-2024-06-25-162526 > aws-ovn-serial > Artifacts > ... > gather-extra artifacts:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1805639599624556544/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name' ip-10-0-125-200.ec2.internal ip-10-0-47-81.ec2.internal ip-10-0-8-158.ec2.internal $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1805639599624556544/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes/ip-10-0-8-158.ec2.internal/journal | zgrep kubelet-ca.crt | tail -n3 Jun 25 19:56:13.452559 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:13.452512 2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt" Jun 25 19:56:13.512277 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:13.512213 2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt" Jun 25 19:56:14.403001 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:14.402953 2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"
gophercloud is outdated, we need to update it to get the latest dependencies and avoid CVEs.
Description of problem:
Version-Release number of selected component (if applicable):
When navigating from Lightspeed's "Don't show again" link, it can be hard to know which element is relevant. We should look at utilizing Spotlight to highlight the relevant user preference. Also, there is an undesirable gap before the Lightspeed user preference caused by an empty div from data-test="console.telemetryAnalytics".
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
Based on feature https://issues.redhat.com/browse/CONSOLE-3243 - Rename "master" to "control plane node" in node pages The name of 'master' on ‘Filter by Node type’ on Cluster Utilization section on Overview page should be updated to 'control plane' But the changes have been covered by PR https://github.com/openshift/console/pull/14121 which bring this issue
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-15-032107
How reproducible:
Always
Steps to Reproduce:
1. Make sure your node role has 'control plan' eg: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME qe-uidaily-1016-dsclx-master-0 Ready control-plane,master 3h v1.31.1 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 418.94.202410111739-0 5.14.0-427.40.1.el9_4.x86_64 cri-o://1.31.1-4.rhaos4.18.gitd8950b8.el9 qe-uidaily-1016-dsclx-master-1 Ready control-plane,master 3h v1.31.1 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 418.94.202410111739-0 5.14.0-427.40.1.el9_4.x86_64 cri-o://1.31.1-4.rhaos4.18.gitd8950b8.el9 2. Navigate to Overview page, check the option on the 'Filter by Node type' dropdown list on Cluster utilization section 3.
Actual results:
control plane option is missing
Expected results:
the master option should update to 'contorl plane'
Additional info:
Description of problem:
The cert-manager operator from redhat-operators is not yet available in the 4.18 catalog. We'll need to use a different candidate in order to update our default catalog images to 4.18 without creating test failures.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
For resources under Networking menu, eg, service, route, ingress, networkpolicy, when access a non-existing resource, the page should show "404 not found" instead of keeping loading the page.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-10-133647 4.17.0-0.nightly-2024-09-09-120947
How reproducible:
Always
Steps to Reproduce:
1.Access a non-existing resource under Networking menu, eg "testconsole" service with url "/k8s/ns/openshift-console/services/testconsole". 2. 3.
Actual results:
1. The page will always be loading. screenshot: https://drive.google.com/file/d/1HpH2BfVUACivI0KghXhsKt3FYgYFOhxx/view?usp=drive_link
Expected results:
1. Should show "404 not found"
Additional info:
Perform the SnykDuty
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/46
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When setting .spec.storage.azure.networkAccess.type: Internal (without providing vnet and subnet names), the image registry will attempt to discover the vnet by tag. Previous to the installer switching to cluster-api, the vnet tagging happened here: https://github.com/openshift/installer/blob/10951c555dec2f156fad77ef43b9fb0824520015/pkg/asset/cluster/azure/azure.go#L79-L92. After the switch to cluster-api, this code no longer seems to be in use, so the tags are no longer there. From inspection of a failed job, the new tags in use seem to be in the form of `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID` instead of the previous `kubernetes.io_cluster.$infraID`. Image registry operator code responsible for this: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L678-L682 More details in slack discussion with installer team: https://redhat-internal.slack.com/archives/C68TNFWA2/p1726732108990319
Version-Release number of selected component (if applicable):
4.17, 4.18
How reproducible:
Always
Steps to Reproduce:
1. Get an Azure 4.17 or 4.18 cluster 2. oc edit configs.imageregistry/cluster 3. set .spec.storage.azure.networkAccess.type to Internal
Actual results:
The operator cannot find the vnet (look for "not found" in operator logs)
Expected results:
The operator should be able to find the vnet by tag and configure the storage account as private
Additional info:
If we make the switch to look for vnet tagged with `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID`, one thing that needs to be tested is BYO vnet/subnet clusters. What I have currently observed in CI is that the cluster has the new tag key with `owned` value, but for BYO networks the value *should* be `shared`, but I have not tested it. --- Although this bug is a regression, I'm not going to mark it as such because this affects a fairly new feature (introduced on 4.15), and there's a very easy workaround (manually setting the vnet and subnet names when configuring network access to internal).
Description of problem:
See https://search.dptools.openshift.org/?search=Kubernetes+resource+CRUD+operations+Secret+displays+detail+view+for+newly+created+resource+instance&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Description of problem:
When use UPDATE_URL_OVERRIDE env, the information is confused: ./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 2024/06/19 12:22:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/06/19 12:22:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/06/19 12:22:38 [INFO] : ⚙️ setting up the environment for you... 2024/06/19 12:22:38 [INFO] : 🔀 workflow mode: mirrorToDisk I0619 12:22:38.832303 66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported 2024/06/19 12:22:38 [INFO] : 🕵️ going to discover the necessary images...
Version-Release number of selected component (if applicable):
./oc-mirror.latest version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202406131541.p0.g157eb08.assembly.stream.el9-157eb08", GitCommit:"157eb085db0ca66fb689220119ab47a6dd9e1233", GitTreeState:"clean", BuildDate:"2024-06-13T17:25:46Z", GoVersion:"go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Set registry on the ocp cluster; 2) do mirror2disk + disk2mirror with following isc: apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: additionalImages: - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6 platform: channels: - name: stable-4.15 type: ocp minVersion: '4.15.10' maxVersion: '4.15.11' graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: elasticsearch-operator 3) set ~/.config/containers/registries.conf [[registry]] location = "quay.io" insecure = false blocked = false mirror-by-digest-only = false prefix = "" [[registry.mirror]] location = "my-route-testzy.apps.yinzhou-619.qe.devcluster.openshift.com" insecure = false 4) use the isc from step 2 and mirror2disk with different dir: `./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1`
Actual results:
./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 2024/06/19 12:22:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/06/19 12:22:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/06/19 12:22:38 [INFO] : ⚙️ setting up the environment for you... 2024/06/19 12:22:38 [INFO] : 🔀 workflow mode: mirrorToDisk I0619 12:22:38.832303 66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported 2024/06/19 12:22:38 [INFO] : 🕵️ going to discover the necessary images... 2024/06/19 12:22:38 [INFO] : 🔍 collecting release images...
Expected results:
Give clear information to clarify the UPDATE_URL_OVERRIDE environment variable slack discuss is here : https://redhat-internal.slack.com/archives/C050P27C71S/p1718800641718869?thread_ts=1718175617.310629&cid=C050P27C71S
The CPO reconciliation aborts when the OIDC/LDAP IDP validation check fails and this result in failure to reconcile for any components that are reconciled after that point in the code.
This failure should not be fatal to the CPO reconcile and should likely be reported as a condition on the HC.
xref
Customer incident
https://issues.redhat.com/browse/OCPBUGS-38071
RFE for bypassing the check
https://issues.redhat.com/browse/RFE-5638
PR to proxy the IDP check through the data plane network
https://github.com/openshift/hypershift/pull/4273
This is a feature request. Sorry, I couldn't find anywhere else to file it. Our team can also potentially implement this feature, so really we're looking for design input before possibly submitting a PR.
User story:
As a user of on-prem OpenShift, I need to manage DNS for my OpenShift cluster manually. I can already specify an IP address for the API server, but I cannot do this for Ingress. This means that I have to:
I would like to simplify this workflow to:
Implementation suggestion:
Our specific target is OpenStack. We could add `OpenStackLoadBalancerParameters` to `ProviderLoadBalancerParameters`, but the parameter we would be adding is 'loadBalancerIP`. This isn't OpenStack-specific. For example, it would be equally applicable to users of either OpenStack's built-in Octavia loadbalancer, or MetalLB, both of which may reasonably be deployed on OpenStack.
I suggest adding an optional LoadBalancerIP to LoadBalancerStrategy here: https://github.com/openshift/cluster-ingress-operator/blob/8252ac492c04d161fbcf60ef82af2989c99f4a9d/vendor/github.com/openshift/api/operator/v1/types_ingress.go#L395-L440
This would be used to pre-populate spec.loadBalancerIP when creating the Service for the default router.
Doc links on list page seems wrong, some are linking to https://docs.openshift.com/dedicated/, they should have similar links like
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.17/html/building_applications/deployments
The list of known plugin names for telemetry does not include kuadrant-console-plugin, which is a Red Hat maintained plugin.
Description of problem:
In upstream and downstream automation testing, we see occasional failures coming from monitoring-plugin For example: Check JUnit report for https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_console/14468/pull-ci-openshift-console-master-e2e-gcp-console/1856100921105190912 Check JUnit report for https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_console/14475/pull-ci-openshift-console-master-e2e-gcp-console/1856095554396753920 Check screenshot when visiting /monitoring/alerts https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-gcp-upi-f7-ui/1855343143403130880/artifacts/gcp-upi-f7-ui/cucushift-e2e/artifacts/ui1/embedded_files/2024-11-09T22:21:41+00:00-screenshot.png
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-11-144244
How reproducible:
more reproducible in automation testing
Steps to Reproduce:
Actual results:
runtime errors
Expected results:
no errors
Additional info:
This is to track the ""permanent solution for https://issues.redhat.com/browse/OCPBUGS-38289 for >= 4.18 as the filed can be set via the Prometheus CR now.
Description of problem:
Login on admin console with normal user, there is "User workload notifications" option in "Notifications" menu on "User Preferences" page. It's not necessary, since normal user have no permission to get alerts.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-05-23-103225
How reproducible:
Always
Steps to Reproduce:
1.Login on admin console with normal user, go to "User Preferences" page. 2.Click "Notifications" menu, check/uncheck "Hide user workload notifications" for "User workload notifications" 3.
Actual results:
2. User could set the option.
Expected results:
3. It's better not show option for "User workload notifications". Since normal user could not get alerts, and there is no Notification Drawer on masthead.
Additional info:
Screenshorts: https://drive.google.com/drive/folders/15_qGw1IkbK1_rIKNiageNlYUYKTrsdKp?usp=share_link
Description of problem:
The pinned images functionality is not working
Version-Release number of selected component (if applicable):
IPI on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-0.nightly-2024-10-28-052434 True False 6h46m Cluster version is 4.18.0-0.nightly-2024-10-28-052434
How reproducible:
Always
Steps to Reproduce:
1. Enable techpreview 2. Create a pinnedimagesets resource $ oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: PinnedImageSet metadata: labels: machineconfiguration.openshift.io/role: worker name: tc-73623-worker-pinned-images spec: pinnedImages: - name: "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019" - name: quay.io/openshifttest/alpine@sha256:be92b18a369e989a6e86ac840b7f23ce0052467de551b064796d67280dfa06d5 EOF
Actual results:
The images are not pinned and the pool is degraded We can see these logs in the MCDs I1028 14:26:32.514096 2341 pinned_image_set.go:304] Reconciling pinned image set: tc-73623-worker-pinned-images: generation: 1 E1028 14:26:32.514183 2341 pinned_image_set.go:240] failed to get image status for "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019": rpc error: code = Unavailable desc = name resolver error: produced zero addresses And we can see the machineconfignodes resources reporting pinnedimagesets degradation: - lastTransitionTime: "2024-10-28T14:27:58Z" message: 'failed to get image status for "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019": rpc error: code = Unavailable desc = name resolver error: produced zero addresses' reason: PrefetchFailed status: "True" type: PinnedImageSetsDegraded
Expected results:
The images should be pinned without errors.
Additional info:
Slack conversation: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1730125766377509 This is Sam's guess (thank you [~sbatschelet] for your quick help, I really appreciate it): My guess is that it is related to https://github.com/openshift/machine-config-operator/pull/4629 Specifically the changes to pkg/daemon/cri/cri.go where we swapped out DialContext for NewClient. Per docs. One subtle difference between NewClient and Dial and DialContext is that the former uses "dns" as the default name resolver, while the latter use "passthrough" for backward compatibility. This distinction should not matter to most users, but could matter to legacy users that specify a custom dialer and expect it to receive the target string directly.
Description of problem:
Remove the extra . from below INFO message when running add-nodes workdflow INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Run oc adm node-image create command to create a node iso 2. See the INFO message at the end 3.
Actual results:
INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z
Expected results:
INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/8957
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
Description of problem:
Before the kubelet systemd service runs kubelet binary it calls the restorecon command: https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/on-prem/units/kubelet.service.yaml#L13 But the restorecon command expects a path to be given. providing a path is mandatory. see man page: https://linux.die.net/man/8/restorecon At the moment the command does nothing and the error is swallowed due to the dash (-) in the beginning of the command. This results with files that are labeled with wrong SELinux labels. for example: After https://github.com/containers/container-selinux/pull/329 got merged /var/lib/kubelet/pod-resources/* expected to be running with kubelet_var_lib_t label but it's not. it's running with the old label - container_var_lib_t
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Check the SELinux labels of files under the system with ls -Z command.
Actual results:
files are labeled with a wrong SELinux labels
Expected results:
file's SELinux labels are suppose the match their configuration as it captured in the container-selinux package.
Additional info:
Description of problem:
We have an OKD 4.12 cluster which has persistent and increasing ingresswithoutclassname alerts with no ingresses normally present in the cluster. I believe the ingresswithoutclassname being counted is created as part of the ACME validation process managed by the cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.
Version-Release number of selected component (if applicable):
4.12.0-0.okd-2023-04-16-041331
How reproducible:
seems very consistent. went away during an update but came back shortly after and continues to increase.
Steps to Reproduce:
1. create ingress w/o classname 2. see counter increase 3. delete classless ingress 4. counter does not decrease.
Additional info:
https://github.com/openshift/cluster-ingress-operator/issues/912
Please review the following PR: https://github.com/openshift/csi-operator/pull/241
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/269
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The reality is that a lot of bare-metal clusters end up using platform=none. For example, SNO's only have this platform value, so SNO users can never use provisioning network (and thus any hardware that does not support virtual media). UPI and UPI-like clusters are by definition something that operators configure for themselves, so locking them out of features makes even less sense.
With OpenStack based on OCP nowadays, I expect to see a sharp increase in complaints about this topic.
Add e2e ests to Show deprecated operators in OperatorHub work.
Open question:
What kind of tests would be most appropriate for this situation, considering the dependencies required for end-to-end (e2e) tests?
Dependencies:
AC:
In https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-ci-release-4.18-e2e-openstack-ovn-etcd-scaling/1834144693181485056 I noticed the following panic:
Undiagnosed panic detected in pod expand_less 0s { pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log.gz:ts=2024-09-12T09:30:09.273Z caller=klog.go:124 level=error component=k8s_client_runtime func=Errorf msg="Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3180480), concrete:(*abi.Type)(0x34a31c0), asserted:(*abi.Type)(0x3a0ac40), missingMethod:\"\"} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)\ngoroutine 13218 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x32f1080, 0xc05be06840})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x90\nk8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc010ef6000?})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b\npanic({0x32f1080?, 0xc05be06840?})\n\t/usr/lib/golang/src/runtime/panic.go:770 +0x132\ngithub.com/prometheus/prometheus/discovery/kubernetes.NewEndpoints.func11({0x34a31c0?, 0xc05bf3a580?})\n\t/go/src/github.com/prometheus/prometheus/discovery/kubernetes/endpoints.go:170 +0x4e\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/controller.go:253\nk8s.io/client-go/tools/cache.(*processorListener).run.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:977 +0x9f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00fc92f70, {0x456ed60, 0xc031a6ba10}, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc011678f70, 0x3b9aca00, 0x0, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161\nk8s.io/client-go/tools/cache.(*processorListener).run(0xc04c607440)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52\ncreated by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 12933\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73\n"}
This issue seems relatively common on openstack, these runs seem to very frequently be this failure.
Linked test name: Undiagnosed panic detected in pod
Description of problem:
Alerts with non-standard severity labels are sent to Telemeter.
Version-Release number of selected component (if applicable):
All supported versions
How reproducible:
Always
Steps to Reproduce:
1. Create an always firing alerting rule with severity=foo. 2. Make sure that telemetry is enabled for the cluster. 3.
Actual results:
The alert can be seen on the telemeter server side.
Expected results:
The alert is dropped by the telemeter allow-list.
Additional info:
Red Hat operators should use standard severities: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide Looking at the current data, it looks like ~2% of the alerts reported to Telemter have an invalid severity.
Description of problem:
After upgrading OCP and LSO to version 4.14, elasticsearch pods in the openshift-logging deployment are unable to schedule to their respective nodes and remain Pending, even though the LSO managed PVs are bound to the PVCs. A test pod using a newly created test PV managed by the LSO is able to schedule correctly however.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Consistently
Steps to Reproduce:
1. 2. 3.
Actual results:
Pods consuming previously existing LSO managed PVs are unable to schedule and remain in a Pending state after upgrading OCP and LSO to 4.14.
Expected results:
That pods would be able to consume LSO managed PVs and schedule correctly to nodes.
Additional info:
Description of problem:
When HO is installed without a pullsecret the shared ingress controller fails to create the router pod because the pullsecret is missing
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1.Install HO without pullsecret 2.Watch HO report error "error":"failed to get pull secret &Secret{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [][]},Data:map[string[]byte{},Type:,StringData:map[string]string{},Immutabl:nil,}: Secret \"pull-secret\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller. 3. Observe that no Router pod is created in the hypershift sharedingress namespace
Actual results:
router pod doesnt get created in hyeprshift sharedingress namespace
Expected results:
router pod gets created in hyeprshift sharedingress namespace
Additional info:
Description of problem:
The description and name for GCP Pool ID is not consist Issue is related to bug https://issues.redhat.com/browse/OCPBUGS-38557
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-19-002129
How reproducible:
Always
Steps to Reproduce:
1. Prepare a WI/FI enabled GCP cluster 2. Go to the web Terminal operator installtion page 3. Check the description and name for GCP Pool ID
Actual results:
The description and name for GCP Pool ID is not consist
Expected results:
The description and name for GCP Pool ID should consist
Additional info:
Screenshot: https://drive.google.com/file/d/1PwiH3xk39pGzCgcHPzIHlv3ABzXYqz1O/view?usp=drive_link
When the openshift-install agent wait-for bootstrap-complete command logs the status of the host validations, it logs the same hostname for all validations, regardless of which host they apply to. This makes it impossible for the user to determine which host needs remediation when a validation fails.
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/68
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
This is a spinoff of https://issues.redhat.com/browse/OCPBUGS-38012. For additional context please see that bug. The TLDR is that Restart=on-failure for oneshot units were only supported in systemd v244 and onwards, meaning any bootimage for 4.12 and previous doesn't support this on firstboot, and upgraded clusters would no longer be able to scale nodes if it references any such service. Right now this is only https://github.com/openshift/machine-config-operator/blob/master/templates/common/openstack/units/afterburn-hostname.service.yaml#L16-L24 which isn't covered by https://issues.redhat.com/browse/OCPBUGS-38012
Version-Release number of selected component (if applicable):
4.16 right now
How reproducible:
Uncertain, but https://issues.redhat.com/browse/OCPBUGS-38012 is 100%
Steps to Reproduce:
1.install old openstack cluster 2.upgrade to 4.16 3.attempt to scale node
Actual results:
Expected results:
Additional info:
Description of problem:
panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1a774eb]goroutine 11358 [running]: testing.tRunner.func1.2({0x1d3d600, 0x3428a50}) /usr/lib/golang/src/testing/testing.go:1631 +0x24a testing.tRunner.func1() /usr/lib/golang/src/testing/testing.go:1634 +0x377 panic({0x1d3d600?, 0x3428a50?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/cluster-ingress-operator/test/e2e.updateDNSConfig(...) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:89 github.com/openshift/cluster-ingress-operator/test/e2e.TestIngressStatus(0xc000511380) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:53 +0x34b testing.tRunner(0xc000511380, 0x218c9f8) /usr/lib/golang/src/testing/testing.go:1689 +0xfb created by testing.(*T).Run in goroutine 11200 /usr/lib/golang/src/testing/testing.go:1742 +0x390 FAIL github.com/openshift/cluster-ingress-operator/test/e2e 1612.553s FAIL make: *** [Makefile:56: test-e2e] Error 1
Version-Release number of selected component (if applicable):
master
How reproducible:
run the cluster-ingress-operator e2e tests against the OpenStack platform.
Steps to Reproduce:
1. 2. 3.
Actual results:
the nil pointer error
Expected results:
no error
Additional info:
Description of problem:
- One node [ rendezvous] is failed to add the cluster and there are some pending CSR's. - omc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-44qjs 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-9n9hc 5m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-9xw24 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-brm6f 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-dz75g 36m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-l8c7v 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-mv7w5 52m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-v6pgd 1h kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
In order to complete the installation, cu needs to approve the those CSR's manually.
Steps to Reproduce:
agent-based installation.
Actual results:
CSR's are in pending state.
Expected results:
CSR's should approved automatically
Additional info:
Logs : https://drive.google.com/drive/folders/1UCgC6oMx28k-_WXy8w1iN_t9h9rtmnfo?usp=sharing
A string comparison is being done with "-eq", it should be using "="
[derekh@u07 assisted-installer-agent]$ sudo podman build -f Dockerfile.ocp STEP 1/3: FROM registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.21-openshift-4.16 AS builder STEP 2/3: RUN if [ "$(arch)" -eq "x86_64" ]; then dnf install -y biosdevname dmidecode; fi /bin/sh: line 1: [: x86_64: integer expression expected --> cb5707d9d703 STEP 3/3: RUN if [ "$(arch)" -eq "aarch64" ]; then dnf install -y dmidecode; fi /bin/sh: line 1: [: x86_64: integer expression expected COMMIT --> 0b12a705f47e 0b12a705f47e015f43d7815743f2ad71da764b1358decc151454ec8802a827fc
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/85
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
When discovering ARM and try to install CNV, I get the following
From inventory, CPU flags are:
cpu":{ "architecture":"aarch64", "count":16, "flags":[ "fp", "asimd", "evtstrm", "aes", "pmull", "sha1", "sha2", "crc32", "atomics", "fphp", "asimdhp", "cpuid", "asimdrdm", "lrcpc", "dcpop", "asimddp", "ssbs" ], "model_name":"Neoverse-N1"
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
normal user without any projects visiting Networking pages, it is always loading
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-08-130531
How reproducible:
Always
Steps to Reproduce:
1. user without any project visit Services, Routes, Ingresses, NetworkPolicies page 2. 3.
Actual results:
these list page are always loading
Expected results:
show getting started guide and dim resources list
Additional info:
Description of problem:
The placeholder "Select one or more NetworkAttachmentDefinitions" is highlighted while selecting nad
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We need to remove the dra_manager_state on kubelet restart to prevent mismatch errors on restart with TechPreview or DevPreview clusters.
failed to run Kubelet: failed to create claimInfo cache: error calling GetOrCreate() on checkpoint state: failed to get checkpoint dra_manager_state: checkpoint is corrupted"
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700
The cluster-dns-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-dns-operator repository also vendors k8s.io/* v0.29.2 packages. However, OpenShift 4.17 is based on Kubernetes 1.30.
4.17.
Always.
Check https://github.com/openshift/cluster-dns-operator/blob/release-4.17/go.mod.
The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/* packages are at v0.29.2.
The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and the k8s.io/* packages are at v0.30.0 or newer.
The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.
Description of problem:
No pagination on the NetworkPolicies table list
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-212926 4.17.0-0.nightly-2024-09-09-120947
How reproducible:
Always
Steps to Reproduce:
1. Naviagate to Networking -> NetworkPolicies page, create multiple resources, at least more than 20 2. Check the NetworkPolicies table list 3.
Actual results:
No pagination on the table
Expected results:
Add pagination, and also it could control by the 'pagination_nav-control' related button/function
Additional info:
Converted the story to track i18n upload/download routine tasks to a bug so that it could be backported to 4.17, as this latest translations batch contains missing translations, including ES language for the 4.17 release.
Original story: https://issues.redhat.com/browse/CONSOLE-4238
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Running oc scale on a nodepool fails with 404 not found
Version-Release number of selected component (if applicable):
Latest hypershift operator
How reproducible:
100%
Steps to Reproduce:
Actual results:
Scaling fails
[2024-10-20 22:13:17] + oc scale nodepool/assisted-test-cluster -n assisted-spoke-cluster --replicas=1
[2024-10-20 22:13:17] Error from server (NotFound): nodepools.hypershift.openshift.io "assisted-test-cluster" not found
Expected results:
Scaling succeeds
Additional info:
Discovered in our CI tests beginning October 17th https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-cluster-api-provider-agent-master-e2e-ai-operator-ztp-capi-periodic
Description of problem:
see from screen recording https://drive.google.com/file/d/1LwNdyISRmQqa8taup3nfLRqYBEXzH_YH/view?usp=sharing
dev console, "Observe -> Metrics" tab, input in in the query-browser input text-area, the cursor would focus in the project drop-down list, this issue exists in 4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129, no such issue with admin console
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
cursor would focus in the project drop-down
Expected results:
cursor should not move
Additional info:
Description of the problem:
B[Staging]BE 2.35.0, UI 2.34.2 - User is not abl to select ODF once CNV is selected as LVMS is repeatedly enabled
How reproducible:
100%
Steps to reproduce:
1. Create new cluster
2. Select cnv
3. LVMS is enabled. disabling it ends up with it being enabled again
Actual results:
Expected results:
Description of problem:
Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15. During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. The CSR remain pending and do not get auto-approved This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually.
Expected results:
CSR should get approved automatically and domain name scheme should not change.
Additional info:
Description of problem:
Navigation: Storage -> VolumeSnapshots -> kebab-menu -> Mouse hover on 'Restore as new PVC' Issue: "Volume Snapshot is not Ready" is in English.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Steps to Reproduce:
1. Log into webconsole and add "?pseudolocalization=true&lng=en" to URL 2. Navigate to Storage -> VolumeSnapshots -> kebab-menu -> Mouse hover on 'Restore as new PVC' 3. "Volume Snapshot is not Ready" is in English.
Actual results:
Content is not marked for translation
Expected results:
Content sould be marked for translation
Additional info:
Reference screenshot added
flowschemas.v1beta3.flowcontrol.apiserver.k8s.io used in manifests/09_flowschema.yaml
Description of problem:
The fix to remove the ssh connection and just add an ssh port test causes a problem with ssh as its not formatted correctly. We see:
level=debug msg=Failed to connect to the Rendezvous Host on port 22: dial tcp: address fd2e:6f44:5dd8:c956::50:22: too many colons in address
Description of problem:
The css of some components isn't loading properly (Banner, Jumplinks)
See screenshot: https://photos.app.goo.gl/2Z1cK5puufGBVBcu5
On the screen cast, ex-aao in namespace default is a banner, and should look like: https://photos.app.goo.gl/n4LUgrGNzQT7n1Pr8
The vertical jumplinks should look like: https://photos.app.goo.gl/8GAs71S43PnAS7wH7
You can test our plugin: https://github.com/artemiscloud/activemq-artemis-self-provisioning-plugin/pull/278
1. yarn
2. yarn start
3. navigate to http://localhost:9000/k8s/ns/default/add-broker
Description of problem:
Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23 At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes. As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node. However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition. As a workaround, customer manually changed this service definition which helped them to scale up new nodes.
Version-Release number of selected component (if applicable):
4.15 , 4.16
How reproducible:
100%
Steps to Reproduce:
1. Install OCP vSphere IPI cluster version 4.8 or 4.9 2. Check "on-prem-resolv-prepender.service" service definition 3. Upgrade it to 4.15.22 or 4.15.23 4. Check if the node scaling is working 5. Check "on-prem-resolv-prepender.service" service definition
Actual results:
Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.
Expected results:
Node sclaing should work without making any manual changes in the service definition.
Additional info:
on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service [Service] Type=oneshot Restart=on-failure RestartSec=10 StartLimitIntervalSec=0 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 -----------> this [Service] Type=oneshot #Restart=on-failure -----------> this RestartSec=10 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine : ~~~ [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 [Service] Type=oneshot Restart=on-failure RestartSec=10 ExecStart=/usr/local/bin/resolv-prepender.sh EnvironmentFile=/run/resolv-prepender/env ~~~ Observed this in the rendered MachineConfig which is assembled with the 00-worker
Description of problem:
If the `template:` field in the vsphere platform spec is defined the installer should not be downloading the OVA
Version-Release number of selected component (if applicable):
4.16.x 4.17.x
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. apply CRD yaml file 2. check the NetworkAttachmentDefinition status
Actual results:
status with error
Expected results:
NetworkAttachmentDefinition has been created
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/275
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
With the newer azure-sdk-for-go replacing go-autorest, there was a change to use ClientCertificateCredential that did not include the `SendCertificateChain` option by default that used to be there. The ARO team requires this be set otherwise the 1p integration for SNI will not work. Old version: https://github.com/Azure/go-autorest/blob/f7ea664c9cff3a5257b6dbc4402acadfd8be79f1/autorest/adal/token.go#L262-L264 New version: https://github.com/openshift/installer-aro/pull/37/files#diff-da950a4ddabbede621d9d3b1058bb34f8931c89179306ee88a0e4d76a4cf0b13R294
Version-Release number of selected component (if applicable):
This was introduced in the OpenShift installer PR: https://github.com/openshift/installer/pull/6003
How reproducible:
Every time we authenticate using SNI in Azure.
Steps to Reproduce:
1. Configure a service principal in the Microsoft tenant using SNI 2. Attempt to run the installer using client-certificate credentials to install a cluster with credentials mode in manual
Actual results:
Installation fails as we're unable to authenticate using SNI.
Expected results:
We're able to authenticate using SNI.
Additional info:
This should not have any affect on existing non-SNI based authentication methods using client certificate credentials. It was previously set in autorest for golang, but is not defaulted to in the newer azure-sdk-for-go. Note that only first party Microsoft services will be able to leverage SNI in Microsoft tenants. The test case for this on the installer side would be to ensure it doesn't break manual credential mode installs using a certificate pinned to a service principal.
All we would need changed is to this pass the ` SendCertificateChain: true,` option only on client certificate credentials. Ideally we could back-port this as well to all openshift versions which received the migration from AAD to Microsoft Graph changes.
Description of problem:
When the image from a build is rolling out on the nodes, the update progress on the node is not displaying correctly.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Enable OCL functionality 2. Opt the pool in by MachineOSConfig 3. Wait for the image to build and roll out 4. Track mcp update status by oc get mcp
Actual results:
The MCP start with O ready pool. While there are 1-2 pools got updated already, the count still remains 0. The count jump to 3 when all the pools are ready.
Expected results:
The update progress should be reflected in the mcp status correctly.
Additional info:
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/333
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CAPA is leaking one EIP in the bootstrap life cycle when creating clustres on 4.16+ with BYO IPv4 Pool on config. The install logs is showing the message of duplicated EIP, there is a kind of race condition when the EIP is created and tried to be associated when the instance isn't ready (Running state): ~~~ time="2024-05-08T15:49:33-03:00" level=debug msg="I0508 15:49:33.785472 2878400 recorder.go:104] \"Failed to associate Elastic IP for \\\"ec2-i-03de70744825f25c5\\\": InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation.\\n\\tstatus code: 400, request id: 7582391c-b35e-44b9-8455-e68663d90fed\" logger=\"events\" type=\"Warning\" object=[...]\"name\":\"mrb-byoip-32-kbcz9\",\"[...] reason=\"FailedAssociateEIP\"" time="2024-05-08T15:49:33-03:00" level=debug msg="E0508 15:49:33.803742 2878400 controller.go:329] \"Reconciler error\" err=<" time="2024-05-08T15:49:33-03:00" level=debug msg="\tfailed to reconcile EIP: failed to associate Elastic IP \"eipalloc-08faccab2dbb28d4f\" to instance \"i-03de70744825f25c5\": InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation." ~~~ The EIP is deleted when the bootstrap node is removed after a success installation, although the bug impacts any new machine with Public IP set using BYO IPv4 provisioned by CAPA. Upstream issue has been opened: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. create install-config.yaml setting platform.aws.publicIpv4Pool=poolID 2. create cluster 3. check the AWS Console, EIP page filtering by your cluster, you will see the duplicated EIP, while only one is associated to the correct bootstrap instance
Actual results:
Expected results:
- installer/capa creates only one EIP for bootstrap when provisioning the cluster - no error messages for expected behavior (ec2 association errors in pending state)
Additional info:
CAPA issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038
openshift/api was bumped in CNO without running codegen. codegen needs to be run
Please review the following PR: https://github.com/openshift/configmap-reload/pull/64
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
for some provisioners, the access mode is not correct. It would be good if we can have someone from storage team to confirm about the access mode values in https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L107
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-12-101500
How reproducible:
Always
Steps to Reproduce:
1. setup a cluster in GCP, check storageclasses $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ssd-csi pd.csi.storage.gke.io Delete WaitForFirstConsumer true 5h37m standard-csi (default) pd.csi.storage.gke.io Delete WaitForFirstConsumer true 5h37m 2. goes to PVC creation page, choose any storageclass in the dropdown and check `Access mode` list
Actual results:
there is only `RWO` access mode
Expected results:
pd.csi.storage.gke.io support both RWO and RWOP supported access mode reference https://docs.openshift.com/container-platform/4.15/storage/understanding-persistent-storage.html#pv-access-modes_understanding-persistent-storage
Additional info:
The fields for `last_installation_preparation_status` for a cluster are currently reset when the user sends a request to `install` the cluster.
In the case that multiple requests are received, this can lead to this status being illegally cleared when it should not be.
It is safer to move this to the state machine where it can be ensured that states have changed in the correct way prior to the reset of this field.
Description of problem:
L3 Egress traffic from pod in segmented network does not work.
Version-Release number of selected component (if applicable):
build openshift/ovn-kubernetes#2274,openshift/api#2005
oc version
Client Version: 4.15.9 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.17.0-0.ci.test-2024-08-28-123437-ci-ln-v5g4wb2-latest Kubernetes Version: v1.30.3-dirty
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster UPI GCP with build from cluster bot
2. Create a namespace test wih NAD as below
oc -n test get network-attachment-definition l3-network-nad -oyaml
apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2024-08-28T17:44:14Z" generation: 1 name: l3-network-nad namespace: test resourceVersion: "108224" uid: 5db4ca26-39dd-45b7-8016-215664e21f5d spec: config: | { "cniVersion": "0.3.1", "name": "l3-network", "type": "ovn-k8s-cni-overlay", "topology":"layer3", "subnets": "10.150.0.0/16", "mtu": 1300, "netAttachDefName": "test/l3-network-nad", "role": "primary" }
3. Create a pod in the segmented namespace test
oc -n test exec -it hello-pod – ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:11 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.17/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:11/64 scope link valid_lft forever preferred_lft forever 3: ovn-udn1@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default link/ether 0a:58:0a:96:03:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.150.3.3/24 brd 10.150.3.255 scope global ovn-udn1 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe96:303/64 scope link valid_lft forever preferred_lft forever
oc -n test exec -it hello-pod – ip r
default via 10.150.3.1 dev ovn-udn1
10.128.0.0/14 via 10.131.0.1 dev eth0
10.131.0.0/23 dev eth0 proto kernel scope link src 10.131.0.17
10.150.0.0/16 via 10.150.3.1 dev ovn-udn1
10.150.3.0/24 dev ovn-udn1 proto kernel scope link src 10.150.3.3
100.64.0.0/16 via 10.131.0.1 dev eth0
100.65.0.0/16 via 10.150.3.1 dev ovn-udn1
172.30.0.0/16 via 10.150.3.1 dev ovn-udn1
4. Try to curl the IP echo server running outside the cluster to see it fail.
oc -n test exec -it hello-pod – curl 10.0.0.2:9095 --connect-timeout 5
curl: (28) Connection timeout after 5001 ms command terminated with exit code 28
Actual results:
curl request fails
Expected results:
curl request should pass
Additional info:
The egress from pod in regular namespace works
oc -n test1 exec -it hello-pod – curl 10.0.0.2:9095 --connect-timeout 5
10.0.128.4
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
The catalogsource file for mirror2mirror is invalid with local cache
Version-Release number of selected component (if applicable):
./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202409091841.p0.g45b1fcd.assembly.stream.el9-45b1fcd", GitCommit:"45b1fcd9df95420d5837dfdd2775891ae3dd6adf", GitTreeState:"clean", BuildDate:"2024-09-09T20:48:47Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. run the mirror2mirror command : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: operators: - catalog: quay.io/openshifttest/nginxolm-operator-index:mirrortest1 `oc-mirror -c config-head.yaml --workspace file://out-head docker://my-route-zhouy.apps.yinzhou0910.qe.azure.devcluster.openshift.com --v2 --dest-tls-verify=false`
Actual results:
The catalogsource file is invalid and create the twice for the catalogsource file: 2024/09/10 10:47:35 [INFO] : 📄 Generating CatalogSource file... 2024/09/10 10:47:35 [INFO] : out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml file created 2024/09/10 10:47:35 [INFO] : out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml file created 2024/09/10 10:47:35 [INFO] : mirror time : 1m41.028961606s 2024/09/10 10:47:35 [INFO] : 👋 Goodbye, thank you for using oc-mirror [fedora@preserve-fedora-yinzhou yinzhou]$ ll out11re/working-dir/cluster-resources/ total 8 -rw-r--r--. 1 fedora fedora 242 Sep 10 10:47 cs-redhat-operator-index-v4-15.yaml -rw-r--r--. 1 fedora fedora 289 Sep 10 10:47 idms-oc-mirror.yaml [fedora@preserve-fedora-yinzhou yinzhou]$ cat out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: cs-redhat-operator-index-v4-15 namespace: openshift-marketplace spec: image: localhost:55000/redhat/redhat-operator-index:v4.15 sourceType: grpc status: {}
Expected results:
The catalogsource file should be created with the registry route not the local cache
Additional info:
Description of problem:
IDMS is set on HostedCluster and reflected in their respective CR in-cluster. Customers can create, update, and delete these today. In-cluster IDMS has no impact.
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
100%
Steps to Reproduce:
1. Create HCP 2. Create IDMS 3. Observe it does nothing
Actual results:
IDMS doesn't change anything if manipulated in data plane
Expected results:
IDMS either allows updates OR IDMS updates are blocked.
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/304
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When the machineconfig tab is opened on the console the below error is displayed. Oh no! Something went wrong Type Error Description: Cannot read properties of undefined (reading 'toString")
Version-Release number of selected component (if applicable):
OCP version 4.17.3
How reproducible:
Every time at customers end.
Steps to Reproduce:
1. Go on console. 2. Under compute tab go to machineconfig tab.
Actual results:
Oh no! Something went wrong
Expected results:
Should be able to see all the available mc.
Additional info:
Description of problem:
When Ingress configuration is specified for a HostedCluster in .spec.configuration.ingress, the configuration fails to make it into the HostedCluster because VAP {{ingress-config-validation.managed.openshift.io}} prevents it.
Version-Release number of selected component (if applicable):
4.18 Hosted ROSA
How reproducible:
Always
Steps to Reproduce:
1. Create a hosted cluster in ROSA with spec: configuration: ingress: domain: "" loadBalancer: platform: aws: type: NLB type: AWS 2. Wait for the cluster to come up 3.
Actual results:
Cluster never finishes applying the payload (reaches Complete) because the console operator fails to reconcile its route.
Expected results:
Cluster finishes applying the payload and reaches Complete
Additional info:
The following error is reported in the hcco log: {"level":"error","ts":"2024-11-12T17:33:09Z","msg":"Reconciler error","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"f4216970-af97-4093-ae72-b7dbe452b767","error":"failed to reconcile global configuration: failed to reconcile ingress config: admission webhook \"ingress-config-validation.managed.openshift.io\" denied the request: Only privileged service accounts may access","errorCauses":[{"error":"failed to reconcile global configuration: failed to reconcile ingress config: admission webhook \"ingress-config-validation.managed.openshift.io\" denied the request: Only privileged service accounts may access"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222"}
Description of problem:
Feature : https://issues.redhat.com/browse/MGMT-18411 when to assited installer v. 2.34.0 but apprently not including in any openshift version to be used in ABI installation.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Went thru a loop to very the different commits to check if this is delivered in any ocp version. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem: https://github.com/openshift/installer/pull/7727 changed the order of some playbooks and we're expected to run the network.yaml playbook before the metadata.json file is created. This isn't a problem with newer version of ansible, that will happily ignore missing var_files, however this is a problem with older ansible that fail with:
[cloud-user@installer-host ~]$ ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/network.yaml" PLAY [localhost] ***************************************************************************************************************************************************************************************************************************** ERROR! vars file metadata.json was not found Could not find file on the Ansible Controller. If you are using a module and expect the file to exist on the remote, see the remote_src option
Description of problem:
When "Create NetworkAttachmentDefinition" button is clicked, the app switches to "Administrator" perspective
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Switch to "Virtualization" perspective 2. Navigate to Network -> NetworkAttachmentDefinitions 3. Click "Create NetworkAttachmentDefinition" button
Actual results:
App switches to "Administrator" perspective
Expected results:
App stays in "Virtualization" perspective
Additional info:
Description of the problem:
FYI - OCP 4.12 has reached end of maintenance support, not it is on extended support.
Looks like OCP 4.12 installations started failing lately due to hosts not discovering. for example - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_assisted-service/6628/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-4-12/1817416612257468416
How reproducible:
Seems like every CI run, haven't tested locally
Steps to reproduce:
Trigger OCP 4.12 installation in the CI
Actual results:
failure, hosts not discovering
Expected results:
Successful cluster installation
Description of problem:
We were told that adding connections to a Transit Gateway also costs an exorbitant amount of money. So the create option tgName now means that we will not clean up the connections during destroy cluster.
Description of problem:
We missed the window to merge the ART 4.17 image PR in time.
Version-Release number of selected component (if applicable):
How reproducible:
Fail to get ART PR merged in time
Steps to Reproduce:
1. Have E2E Tests fail for a while. 2. Go on vacation afterwards.
Actual results:
I got asked about 4.17 OCP images.
Expected results:
I don't get asked about 4.17 OCP images.
Additional info:
Description of problem:
We identified a regression where we can no longer get oauth tokens for HyperShift v4.16 clusters via the OpenShift web console. v4.16.10 works fine, but once clusters are patched to v4.16.16 (or are created at that version) they fail to get the oauth token. This is due to this faulty PR: https://github.com/openshift/hypershift/pull/4496. The oauth openshift deployment was changed and affected the IBM Cloud code path. We need this endpoint to change back to using `socks5`. Bug: < value: socks5://127.0.0.1:8090 --- > value: http://127.0.0.1:8092 98c98 < value: socks5://127.0.0.1:8090 --- > value: http://127.0.0.1:80924:53 Fix: Change http://127.0.0.1:8092 to socks5://127.0.0.1:8090
Version-Release number of selected component (if applicable):
4.16.16
How reproducible:
Every time.
Steps to Reproduce:
1. Create ROKS v4.16.16 HyperShift-based cluster. 2. Navigate to the OpenShift web console. 2. Click IAM#<username> menu in the top right. 3. Click 'Copy login command'. 4. Click 'Display token'.
Actual results:
Error getting token: Post "https://example.com:31335/oauth/token": http: server gave HTTP response to HTTPS client
Expected results:
The oauth token should be successfully displayed.
Additional info:
Description of problem:
Day2 add node with oc binary is not working for ARM64 on baremetal CI running
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Running compact agent installation on arm64 platform 2. After the cluster is ready, run day2 install 3. Day2 install fail with error, worker-a-00 is not reachable
Actual results:
Day2 install exit with error.
Expected results:
Day2 install should works
Additional info:
Job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/54181/rehearse-54181-periodic-ci-openshift-openshift-tests-private-release-4.17-arm64-nightly-baremetal-compact-abi-ipv4-static-day2-f7/1823641309190033408 Error message from console when running day2 install: rsync: [sender] link_stat "/assets/node.x86_64.iso" failed: No such file or directory (2) command terminated with exit code 23 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1823) [Receiver=3.2.3] rsync: [Receiver] write error: Broken pipe (32) error: exit status 23 {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-08-13T14:32:20Z"} error: failed to execute wrapped command: exit status 1
/boot/efi and /sysroot dir and subfiles are unlabeled_t
Description of problem:
When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
console message shows start installation service and agent register service has not started
Expected results:
console message shows agent import cluster and add host services has started
Additional info:
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/116
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When the openshift-install agent wait-for bootstrap-complete command cannot connect to either the k8s API or the assisted-service API, it tries to ssh to the rendezvous host to see if it is up.
If there is a running ssh-agent on the local host, we connect to it to make use of its private keys. This is not guaranteed to work, as the private key corresponding to the public key in the agent ISO may not be present on the box.
If there is no running ssh-agent, we use the literal public key as the path to a file that we expect to contain the private key. This is guaranteed not to work.
All of this generates a lot of error messages at DEBUG level that are confusing to users.
If we did succeed in ssh-ing to the host when it has already joined the cluster, the node would end up tainted as a result, which we want to avoid. (This is unlikely in practice though, because by the time the rendezvous host joins, the k8s API should be up so we wouldn't normally run this code at that time.)
We should stop doing all of this, and maybe just ping the rendezvous host to see if it is up.
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535
Description of problem:
INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision... E0819 14:17:33.676051 2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" E0819 14:17:33.708233 2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" I0819 14:17:33.708279 2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
Description of problem:
Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented. On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power. Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful. [1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371
Version-Release number of selected component (if applicable):
Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions
How reproducible:
Always
Steps to Reproduce:
1. Deploy SNO node using ACM and fakefish as redfish interface 2. Check metal3-ironic pod logs
Actual results:
We can see a soft power_off command sent to the ironic agent running on the ramdisk: 2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197 2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234
Expected results:
There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.
Additional info:
Looks relatively new in serial jobs on aws and vsphere. First occurrence I see is Wednesday at around 5am. It's not every run but it is quite common. (10-20% of the time)
Caught by test: Undiagnosed panic detected in pod
Undiagnosed panic detected in pod expand_less 0s { pods/openshift-ovn-kubernetes_ovnkube-control-plane-558bfbcf78-nfbnw_ovnkube-cluster-manager_previous.log.gz:E1106 08:04:15.797587 1 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}
See component readiness for more runs:
Please review the following PR: https://github.com/openshift/csi-operator/pull/81
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If an invalid mac address is used in the interfaces table in agent-config.yaml, like this {noformat} - name: eno2 macAddress: 98-BE-94-3F-48-42 {noformat} it results in the failing to register the Infraenv with assisted-service and constant retries {noformat} Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=info msg="Registering infraenv" Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Reference to cluster id: 1f38e4c9-afde-4ac0-aa32-aabc75ec088a" Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Registering infraenv" Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=info msg="Added 1 nmstateconfigs" Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Added 1 nmstateconfigs" Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=fatal msg="Failed to register infraenv with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}" Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=fatal msg="Failed to register infraenv with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}" {noformat} The error above was in 4.15. In 4.18 I can duplicate it and its only marginally better. There is slightly more info due to an assisted-service change, but same net result of retrying continually on the Registering infraenv" {noformat} Sep 11 20:57:26 master-0 agent-register-infraenv[3013]: time="2024-09-11T20:57:26Z" level=fatal msg="Failed to register infraenv with assisted-service: json: cannot unmarshal number into Go struct field Error.code of type string" Sep 11 20:57:26 master-0 podman[2987]: time="2024-09-11T20:57:26Z" level=fatal msg="Failed to register infraenv with assisted-service: json: cannot unmarshal number into Go struct field Error.code of type string" {noformat}
Version-Release number of selected component (if applicable):
Occcurs both in latest 4.18 and 4.15.26
How reproducible:
Steps to Reproduce:
1. Use an invalid mac address in the interface table like this {noformat} interfaces: - name: eth0 macAddress: 00:59:bd:23:23:8c - name: eno12399np0 macAddress: 98-BE-94-3F-51-33 networkConfig: interfaces: - name: eno12399np0 type: ethernet state: up ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false - name: eth0 type: ethernet state: up mac-address: 00:59:bd:23:23:8c ipv4: enabled: true address: - ip: 192.168.111.80 prefix-length: 24 dhcp: false {noformat} 2. Generate the agent ISO 3. Install using the agent ISO, I just did an SNO installation.
Actual results:
Install fails with the errors: {noformat} level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API level=debug msg=infraenv is not registered in rest API {noformat}
Expected results:
The invalid mac address should be detected when creating the ISO image so it can be fixed.
Additional info:
The following test is failing:
[sig-api-machinery] ValidatingAdmissionPolicy [Privileged:ClusterAdmin] should type check a CRD [Suite:openshift/conformance/parallel] [Suite:k8s]
Additional context here:
This was a problem back in 4.16 when the test had Beta in the name. https://issues.redhat.com/browse/OCPBUGS-30767
But the test continues to be quite flaky and we just got unlucky and failed a payload on it.
The failure always seems to be:
{ fail [k8s.io/kubernetes/test/e2e/apimachinery/validatingadmissionpolicy.go:380]: wait for type checking: PatchOptions.meta.k8s.io "" is invalid: fieldManager: Required value: is required for apply patch Error: exit with code 1 Ginkgo exit error 1: exit with code 1}
It often works on a re-try. (flakes)
Something is not quite right either with this test or the product.
Description of problem:
cluster-capi-operator is running its controllers on AzureStackCloud. And it shouldn't because CAPI is not supported for AzureStackCloud.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
When removing a spoke BMH resource from the hub cluster the node it being shutdown. Previously, the BMH was just removed and the node wasn't affected in anyway. It seems to be due to new behavior in the BMH finalizer that removes the paused annotation from the BMH.
How reproducible:
100%
Steps to reproduce:
1. Install a spoke cluster
2. Remove one of the spoke cluster BMHs from the hub cluster
Actual results:
Correlating node is shutdown
Expected results:
Correlating node is not shutdown
Description of problem:
same admin console bug OCPBUGS-31931 on developer console, 4.15.17 cluster, kubeadmin user goes to developer console UI, click "Observe", select one project, example: openshift-monitoring, select Silences tab, click "Create silence", Creator filed is not auto filled with user name, add label name/value, and Comment to create silence.
will see error on page
An error occurred createdBy in body is required
see picture: https://drive.google.com/file/d/1PR64hvpYCC-WOHT1ID9A4jX91LdGG62Y/view?usp=sharing
this issue exists in 4.15/4.16/4.17/4.18, no issue with 4.14
Version-Release number of selected component (if applicable):
4.15.17
How reproducible:
alwawys
Steps to Reproduce:
see the description
Actual results:
Creator filed is not auto filled with user name
Expected results:
no error
Additional info:
This action returns empty/blank page
Description of problem:
Filter dropdown doesn't collapse on second click
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-21-132049
How reproducible:
Always
Steps to Reproduce:
1. Navigate to Workloads -> Pod page 2. Click the 'Filter' dropdown component 3. Click the 'Filter' dropdown again
Actual results:
Compare with OCP4.17, the dropdown list could be collapsed after the second click But current on OCP4.18, the dropdown list cannot collapse
Expected results:
the dropdown can collapse after click
Additional info:
Description of problem:
We should add validation in the Installer when public-only subnets is enabled to make sure that: 1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set 2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal 3. If this flag is only applicable for byo-vpc configuration, we could consider exit earlier if no subnets provided in install-config.
Version-Release number of selected component (if applicable):
all versions that support public-only subnets
How reproducible:
always
Steps to Reproduce:
1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY 2. Do a cluster install without specifying a VPC. 3.
Actual results:
No warning about the invalid configuration.
Expected results:
Additional info:
This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.
Description of problem:
Create image pull secret with whitespace in the beginning/end of username and password, decode the auth in the '.dockerconfigjson' of the secret, it still contains whitespace in password.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-07-29-134911
How reproducible:
Always
Steps to Reproduce:
1.Create image pull secret with whitespace in the beginning/end of username and password, eg: ' testuser ',' testpassword ' 2.Check on the secret details page, reveal values of ".dockerconfigjson", decode the value of 'auth'. 3.
Actual results:
1. Secret is created. 2. There is not whitespace in value for username and password. But the decoded result of 'auth' contains whitespace in password. $ echo 'dGVzdHVzZXI6ICB0ZXN0cGFzc3dvcmQgIA==' | base64 -d testuser: testpassword
Expected results:
1. Should not contain whitespace in password after decode auth. eg: testuser:testpassword
Additional info:
Description of problem:
The rails example "rails-postgresql-example" no longer runs successfully, because it references a version of ruby that is not available in the library. This is blocking the release of Samples Operator because we check the validity of the templates shipped with the operator. Rails sample is no longer supported by the Samples Operator but is still shipped in an old version. I.e. we just continue shipping the same old version of the sample across releases. This old version references ruby that is no longer present in the openshift library. There are a couple of ways of solving this problem: 1. Start supporting the Rails sample again in Samples Operator (the Rails examples seem to be maintained and made also available through helm-charts). 2. Remove the test that makes sure rails example is buildable to let the test suite pass. We don't support rails anymore in the Samples Operator so this should not be too surprising. 3. Remove rails from the Samples Operator altogether. This is probably the cleanest solution but most likely requires more work than just removing the sample from the assets of Samples Operator (removing the failing test is the most obvious thing that would break, too). We need to decide ASAP how to proceed to unblock the release of Samples Operator for OCP 4.17.
Version-Release number of selected component (if applicable):
How reproducible:
The Samples Operator testsuite runs these tests and results in a failure like this: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-samples-operator/567/pull-ci-openshift-cluster-samples-operator-master-e2e-aws-ovn-image-ecosystem/1829111792509390848
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
The test in question fails here: https://github.com/openshift/origin/blob/master/test/extended/image_ecosystem/s2i_ruby.go#L59 The line in the test output that stands out: I0829 13:02:24.241018 3111 dump.go:53] At 2024-08-29 13:00:21 +0000 UTC - event for rails-postgresql-example: {buildconfig-controller } BuildConfigInstantiateFailed: error instantiating Build from BuildConfig e2e-test-s2i-ruby-q75fj/rails-postgresql-example (0): Error resolving ImageStreamTag ruby:3.0-ubi8 in namespace openshift: unable to find latest tagged image
Please review the following PR: https://github.com/openshift/csi-operator/pull/271
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/107
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When configuring the OpenShift image registry to use a custom Azure storage account in a different resource group, following the official documentation [1], the image-registy CO degrade and upgrade from version 4.14.x to 4.15.x fails. The image registry operator reports misconfiguration errors related to Azure storage credentials, preventing the upgrade and causing instability in the control plane.
[1] Configuring registry storage in Azure user infrastructure
Version-Release number of selected component (if applicable):
4.14.33, 4.15.33
How reproducible:
Steps to Reproduce:
We got the error
NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: client misconfigured, missing 'TenantID', 'ClientID', 'ClientSecret', 'FederatedTokenFile', 'Creds', 'SubscriptionID' option(s)
The oeprator will also genreate a new secret image-registry-private-configuration with the same content as image-registry-private-configuration-user
$ oc get secret image-registry-private-configuration -o yaml apiVersion: v1 data: REGISTRY_STORAGE_AZURE_ACCOUNTKEY: xxxxxxxxxxxxxxxxx kind: Secret metadata: annotations: imageregistry.operator.openshift.io/checksum: sha256:524fab8dd71302f1a9ade9b152b3f9576edb2b670752e1bae1cb49b4de992eee creationTimestamp: "2024-09-26T19:52:17Z" name: image-registry-private-configuration namespace: openshift-image-registry resourceVersion: "126426" uid: e2064353-2511-4666-bd43-29dd020573fe type: Opaque
2. then we delete the secret image-registry-private-configuration-user
now the secret image-registry-private-configuration will still exisit with the same content, but image-registry CO got a new error
NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account arojudesa: storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Storage/storageAccounts/arojudesa' under resource group 'aro-ufjvmbl1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"
3. apply the workaround to manually changeing the secret installer-cloud-credentials azure_resourcegroup key with custom storage account resourcegroup
$ oc get secret installer-cloud-credentials -o yaml apiVersion: v1 data: azure_client_id: xxxxxxxxxxxxxxxxx azure_client_secret: xxxxxxxxxxxxxxxxx azure_region: xxxxxxxxxxxxxxxxx azure_resource_prefix: xxxxxxxxxxxxxxxxx azure_resourcegroup: xxxxxxxxxxxxxxxxx <<<<<-----THIS azure_subscription_id: xxxxxxxxxxxxxxxxx azure_tenant_id: xxxxxxxxxxxxxxxxx kind: Secret metadata: annotations: cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-image-registry-azure creationTimestamp: "2024-09-26T16:49:57Z" labels: cloudcredential.openshift.io/credentials-request: "true" name: installer-cloud-credentials namespace: openshift-image-registry resourceVersion: "133921" uid: d1268e2c-1825-49f0-aa44-d0e1cbcda383 type: Opaque
The image-registry report healthy and this help the continue the upgrade
Actual results:
The image registry seems still use the service principal way for Azure storage account authentication
Expected results:
We expect the REGISTRY_STORAGE_AZURE_ACCOUNTKEY should the only thing image registry operator need for storage account authentication if Customer provide
Additional info:
Slack : https://redhat-internal.slack.com/archives/CCV9YF9PD/p1727379313014789
Description of problem:
The installer for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%, dependent on order of subnets returned by IBM Cloud API's however
Steps to Reproduce:
1. Create 50+ IBM Cloud VPC Subnets 2. Use Bring Your Own Network (BYON) configuration (with Subnet names for CP and/or Compute) in install-config.yaml 3. Attempt to create manifests (openshift-install create manifests)
Actual results:
ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-1", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-2", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-3", platform.ibmcloud.controlPlaneSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-cp-eu-de-1", "eu-de-subnet-paginate-1-cp-eu-de-2", "eu-de-subnet-paginate-1-cp-eu-de-3"}: number of zones (0) covered by controlPlaneSubnets does not match number of provided or default zones (3) for control plane in eu-de, platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-1", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-2", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-3", platform.ibmcloud.computeSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-compute-eu-de-1", "eu-de-subnet-paginate-1-compute-eu-de-2", "eu-de-subnet-paginate-1-compute-eu-de-3"}: number of zones (0) covered by computeSubnets does not match number of provided or default zones (3) for compute[0] in eu-de]
Expected results:
Successful manifests and cluster creation
Additional info:
IBM Cloud is working on a fix
Please review the following PR: https://github.com/openshift/csi-operator/pull/243
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/71
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Recently, sos package was added to the tools image used when invoking oc debut node/<some-node> (details in z).
However, the change just added the sos package without taking into account other required conditions required by sos report to work inside a container.
For reference, the toolbox container has to be launched as follows for sos report to work properly (the comand output tells you the template of the right podman run command):
$ podman inspect registry.redhat.io/rhel9/support-tools | jq -r '.[0].Config.Labels.run' podman run -it --name NAME --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=NAME -e IMAGE=IMAGE -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host IMAGE
The most crucial thing is the HOST=/host environment variable, which makes sos report find the real root of the machine in /host, but the other ones are also required.
So if we are to support sos report in the tools image, the debug node container defaults should be changed such that container runs with the same settings than in the reference podman run indicated above.
4.16 only
Always
Start a debug node container (oc debug node/<node>) and try to gather sos report (without chroot /host + toolbox, just from debug container).
(none)
Description of the problem:
When tring to add a node on day2 using assisted-installer the node reports the disk to not be eligible as installation disk:
Thread: https://redhat-external.slack.com/archives/C05N3PY1XPH/p1731575515647969
Possible issue: https://github.com/openshift/assisted-service/blob/master/internal/hardware/validator.go#L117-L120 => the openshift version is not filled on day2
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
When verifying feature ACM Alerting UI, following doc,] face issue 'silence alert action link has bad format compare to CMO's same action '
Description of problem:
Camel K provides a list of Kamelets that are able to act as an event source or sink for a Knative eventing message broker. Usually the list of Kamelets installed with the Camel K operator are displayed in the Developer Catalog list of available event sources with the provider "Apache Software Foundation" or "Red Hat Integration". When a user adds a custom Kamelet custom resource to the user namespace the list of default Kamelets coming from the Camel K operator is gone. The Developer Catalog event source list then only displays the custom Kamelet but not the default ones.
Version-Release number of selected component (if applicable):
How reproducible:
Apply a custom Kamelet custom resource to the user namespace and open the list of available event sources in Dev Console Developer Catalog.
Steps to Reproduce:
1. install global Camel K operator in operator namespace (e.g. openshift-operators) 2. list all available event sources in "default" user namespace and see all Kamelets listed as event sources/sinks 3. add a custom Kamelet custom resource to the default namespace 4. see the list of available event sources only listing the custom Kamelet and the default Kamelets are gone from that list
Actual results:
Default Kamelets that act as event source/sink are only displayed in the Developer Catalog when there is no custom Kamelet added to a namespace.
Expected results:
Default Kamelets coming with the Camel K operator (installed in the operator namespace) should always be part of the Developer Catalog list of available event sources/sinks. When the user adds more custom Kamelets these should be listed, too.
Additional info:
Reproduced with Camel K operator 2.2 and OCP 4.14.8
screenshots: https://drive.google.com/drive/folders/1mTpr1IrASMT76mWjnOGuexFr9-mP0y3i?usp=drive_link
Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/231
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-gcp-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
compile errors when building an ironic image look like this
2024-08-14 09:07:21 + python3 -m compileall --invalidation-mode=timestamp /usr 2024-08-14 09:07:21 Listing '/usr'... 2024-08-14 09:07:21 Listing '/usr/bin'... ... Listing '/usr/share/zsh/site-functions'... Listing '/usr/src'... Listing '/usr/src/debug'... Listing '/usr/src/kernels'... Error: building at STEP "RUN prepare-image.sh && rm -f /bin/prepare-image.sh && /bin/prepare-ipxe.sh && rm -f /tmp/prepare-ipxe.sh": while running runtime: exit status 1
with the actual error lost in 3000+ lines of output, we should suppress the file listings
Description of problem:
I see that when one release is declared in the ImageSetConfig.yaml everything works well with respect to creating release signature configmap, but when more than one release is added to ImageSetConfig.yaml i see that binaryData content in the signature configmap is duplicated and there is more than specified releases present in the signatures directory. See below ImageSetConfig.yaml: ================= [fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-232.yaml apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.16 minVersion: 4.16.0 maxVersion: 4.16.0 - name: stable-4.15 minVersion: 4.15.0 maxVersion: 4.15.0 Content in Signatures directory: ========================= [fedora@preserve-fedora-yinzhou test]$ ls -l CLID-232/working-dir/signatures/ total 12 -rw-r--r--. 1 fedora fedora 896 Sep 25 11:27 4.15.0-x86_64-sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363 -rw-r--r--. 1 fedora fedora 897 Sep 25 11:27 4.15.31-x86_64-sha256-c03bbdd63fa8832266a2cf0d9fbcd2867692d9ba7e09d31bc77d15dd9903e36f -rw-r--r--. 1 fedora fedora 899 Sep 25 11:27 4.16.0-x86_64-sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3 Content in Signature Configmap: ========================== apiVersion: v1 binaryData: sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363-2: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphboVxQEWxbl6SZl5qXe29lQrJRdllmQmJ+YoWSlUK2XmJqanglkp+cnZqUW6uYl5mWmpxSW6KZnpQAoopVSckWhkamZlkJJoZmxoZmJmlmJmkGicaGJqbJpimpaaYmxqYZmWmmRgbGloaWFpaWGUlpRoapliYp5ikmxuZGlpaJRiZmxmrFSro6BUUlkAsk4psSQ/NzNZITk/ryQR6LAiBaBr8xJLSotSlYCqMlNS80oySyqRHVaUmpZalJqXDNZeWJpYqZeZr59fkJpXnJGZVgKUzklNLE7VTUkt089PLoDxrUz0DE31DHQrLMzizUyUakFuyC8oyczPgwZAclEq0C1FIEODUlMUPBJLFPyBhgaDDFUIBjoqMy9dwbG0JCMfGGyVCgZ6BnqGQGM6mWRYGBg5GNhYmUChysDFKQCLgT4zAYZeps1bfryz7j15qafOW3Dqwv8q1gUhm2eahBm6BEgZRp1fNN1LJEQi0PW1qVrTmQnusy7Pq/t2qcrj83LOh7b7uhMlL7AF3j6QM/HdoTTFaZsulu3qm/FU7SCTwhUH+WsaJw2/l2/bpKDEmvI29TPTCs0pJrFt1UGds0OXeuZf/Pvo9Y8WWw/7sA0lrA0daz6Ef9RdPsGdU+SDpjCrRuai8oIbavb9Fz22FvYv/eMk/dv26L6MPqaU1R56Sz8LVJQ1XQrk3Dzl+THGVZ97BOS0znjwn/RLvsNvc/8V8w39xV/XuhvskLMXfjPp5pErMtbKPMte5krmeEefy5uWvyi9dUPesedH/ey8l894t/RM1odKsaZwtx2X8tecb/eZGsd64P/c77cOnYiX62POMY+L2Xom4bVk5DnDncrKsictr/4yDjnO5Heg0uHN6k1rkv88Ez5yy+HU009+V1l3eFUfVVhfahQS/5trr3JrtIvKFln+s9L17+9brQp10wtkeTqt5OOZrftY7Nqk1mcLejxanF7uyHvSIj+vUPDZhk4GU+MAZ4a3zCfSdeb2l4REqdRwVhoXf7u9/6qnYf79L2IOHE4RzOVbghwsXgWa3T715rLQwT7e/SuYBYqWf87c+CFw0/QTPg3vmI/G/qhaKvLf3sy7U+N2TVDe9OUqj0/wvBI/yOV0y0Mpet1ZRt+zH9tllRMkH60PSd23EAA= sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363-3: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphboVxQEWxbl6SZl5qXe29lQrJRdllmQmJ+YoWSlUK2XmJqanglkp+cnZqUW6uYl5mWmpxSW6KZnpQAoopVSckWhkamZlkJJoZmxoZmJmlmJmkGicaGJqbJpimpaaYmxqYZmWmmRgbGloaWFpaWGUlpRoapliYp5ikmxuZGlpaJRiZmxmrFSro6BUUlkAsk4psSQ/NzNZITk/ryQR6LAiBaBr8xJLSotSlYCqMlNS80oySyqRHVaUmpZalJqXDNZeWJpYqZeZr59fkJpXnJGZVgKUzklNLE7VTUkt089PLoDxrUz0DE31DHQrLMzizUyUakFuyC8oyczPgwZAclEq0C1FIEODUlMUPBJLFPyBhgaDDFUIBjoqMy9dwbG0JCMfGGyVCgZ6BnqGQGM6mWRYGBg5GNhYmUChysDFKQCLgT4zAYZeps1bfryz7j15qafOW3Dqwv8q1gUhm2eahBm6BEgZRp1fNN1LJEQi0PW1qVrTmQnusy7Pq/t2qcrj83LOh7b7uhMlL7AF3j6QM/HdoTTFaZsulu3qm/FU7SCTwhUH+WsaJw2/l2/bpKDEmvI29TPTCs0pJrFt1UGds0OXeuZf/Pvo9Y8WWw/7sA0lrA0daz6Ef9RdPsGdU+SDpjCrRuai8oIbavb9Fz22FvYv/eMk/dv26L6MPqaU1R56Sz8LVJQ1XQrk3Dzl+THGVZ97BOS0znjwn/RLvsNvc/8V8w39xV/XuhvskLMXfjPp5pErMtbKPMte5krmeEefy5uWvyi9dUPesedH/ey8l894t/RM1odKsaZwtx2X8tecb/eZGsd64P/c77cOnYiX62POMY+L2Xom4bVk5DnDncrKsictr/4yDjnO5Heg0uHN6k1rkv88Ez5yy+HU009+V1l3eFUfVVhfahQS/5trr3JrtIvKFln+s9L17+9brQp10wtkeTqt5OOZrftY7Nqk1mcLejxanF7uyHvSIj+vUPDZhk4GU+MAZ4a3zCfSdeb2l4REqdRwVhoXf7u9/6qnYf79L2IOHE4RzOVbghwsXgWa3T715rLQwT7e/SuYBYqWf87c+CFw0/QTPg3vmI/G/qhaKvLf3sy7U+N2TVDe9OUqj0/wvBI/yOV0y0Mpet1ZRt+zH9tllRMkH60PSd23EAA= sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3-1: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphbqJSUGVJf56SZl5adU9WtVKyUWZJZnJiTlKVgrVSpm5iempYFZKfnJ2apFubmJeZlpqcYluSmY6kAJKKRVnJBqZmlkZmxuaGxtbGJiYpqQZmKUaG6ampaUmmpiZmxkmGSWbp1oapJmaGCcnm6QZGJiamCemWRiaWqSkmCWmJlqYWSQbK9XqKCiVVBaArFNKLMnPzUxWSM7PK0nMzEstUgC6Ni+xpLQoVQmoKjMlNa8ks6QS2WFFqWmpRal5yWDthaWJlXqZ+fr5Bal5xRmZaSVA6ZzUxOJU3ZTUMv385AIY38pEz9BMz0C3wsIs3sxEqRbkhvyCksz8PGgAJBelAt1SBDI0KDVFwSOxRMEfaGgwyFCFYKCjMvPSFRxLSzLygcFWqWCgZ6BnCDSmk0mGhYGRg4GNlQkUqgxcnAKwGNiSIcDQLFrmt8ZarfU0234jphipJx9PrVWR6Ne1P/lzlbnN1blfXt+UWnXz4NW1Ne/eHNI+vNpyxpe0VZozL1YKlMg+VCo+uul5S4t3L+8byXsmb98vdVy61TLumM+0Ta1WuikS3NfVlvPNLJ6y4+6qX74pz9pqnXbr32lxenH6btxcpW+C21ICAxd9tOkST7Vemn7kedPrOXyPCkQ5blZK1BdaPYndXcMZK3AsI7a4SqMsrvH2pNgVRU+X3z1t/umAHWv4FbZowW8zDnZtt1ov5215R/dtsXOw4fwEi5WtClM55h0908FyYOor+7/HI0qPZ3DsP8DPIZy4YOl38fb5PPOCTP8fm8t++erKN9mbAh7+Yo90eO8urXuho6OitC3hcIjpf9HiSBMl13fOt6MEF7zsn7Zj5oI7x5Y2Hr6ys/RNxnPZgjlh/pkdr7OccxM2zLFvXTN7b7n0r3dq277/LvuYl+l+e16u18bpMbmZu2VtkmYY31h94+uCaN3I43tbJLmXTtly97Yyc23LrtxtK7PM5K4oSd0oMJ7zaN3Ssr0bEo8GFIT7m9eY/3leG/76McPKO5uDHji8zpWUnfNyv2L315RVXc+usYuwf/v81PvHlz3Vt/49PTFNILy04pjQv788culLEi1edk2amaH5zTfBvN407aP4i6NzPwi98O5nac/cHbZLzDEw4iXjpHWsuWZPzJhyNF3myQb3SlQ7AQA= sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3-5: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphbqJSUGVJf56SZl5adU9WtVKyUWZJZnJiTlKVgrVSpm5iempYFZKfnJ2apFubmJeZlpqcYluSmY6kAJKKRVnJBqZmlkZmxuaGxtbGJiYpqQZmKUaG6ampaUmmpiZmxkmGSWbp1oapJmaGCcnm6QZGJiamCemWRiaWqSkmCWmJlqYWSQbK9XqKCiVVBaArFNKLMnPzUxWSM7PK0nMzEstUgC6Ni+xpLQoVQmoKjMlNa8ks6QS2WFFqWmpRal5yWDthaWJlXqZ+fr5Bal5xRmZaSVA6ZzUxOJU3ZTUMv385AIY38pEz9BMz0C3wsIs3sxEqRbkhvyCksz8PGgAJBelAt1SBDI0KDVFwSOxRMEfaGgwyFCFYKCjMvPSFRxLSzLygcFWqWCgZ6BnCDSmk0mGhYGRg4GNlQkUqgxcnAKwGNiSIcDQLFrmt8ZarfU0234jphipJx9PrVWR6Ne1P/lzlbnN1blfXt+UWnXz4NW1Ne/eHNI+vNpyxpe0VZozL1YKlMg+VCo+uul5S4t3L+8byXsmb98vdVy61TLumM+0Ta1WuikS3NfVlvPNLJ6y4+6qX74pz9pqnXbr32lxenH6btxcpW+C21ICAxd9tOkST7Vemn7kedPrOXyPCkQ5blZK1BdaPYndXcMZK3AsI7a4SqMsrvH2pNgVRU+X3z1t/umAHWv4FbZowW8zDnZtt1ov5215R/dtsXOw4fwEi5WtClM55h0908FyYOor+7/HI0qPZ3DsP8DPIZy4YOl38fb5PPOCTP8fm8t++erKN9mbAh7+Yo90eO8urXuho6OitC3hcIjpf9HiSBMl13fOt6MEF7zsn7Zj5oI7x5Y2Hr6ys/RNxnPZgjlh/pkdr7OccxM2zLFvXTN7b7n0r3dq277/LvuYl+l+e16u18bpMbmZu2VtkmYY31h94+uCaN3I43tbJLmXTtly97Yyc23LrtxtK7PM5K4oSd0oMJ7zaN3Ssr0bEo8GFIT7m9eY/3leG/76McPKO5uDHji8zpWUnfNyv2L315RVXc+usYuwf/v81PvHlz3Vt/49PTFNILy04pjQv788culLEi1edk2amaH5zTfBvN407aP4i6NzPwi98O5nac/cHbZLzDEw4iXjpHWsuWZPzJhyNF3myQb3SlQ7AQA= sha256-c03bbdd63fa8832266a2cf0d9fbcd2867692d9ba7e09d31bc77d15dd9903e36f-4: owGbwMvMwMEoOU9/4l9n2UDGtYwpSWLxRQW5xZnpukWphbpZ+ZXhZuF6SZl5abcZJKuVkosySzKTE3OUrBSqlTJzE9NTwayU/OTs1CLd3MS8zLTU4hLdlMx0IAWUUirOSDQyNbNKNjBOSkpJMTNOS7SwMDYyMjNLNEpOM0ixTEtKTjGyMDM3szRKsUxKNE81sEwxNkxKNjdPMTRNSbG0NDBONTZLU6rVUVAqqSwAWaeUWJKfm5mskJyfV5KYmZdapAB0bV5iSWlRqhJQVWZKal5JZkklssOKUtNSi1LzksHaC0sTK/Uy8/XzC1LzijMy00qA0jmpicWpuimpZfr5yQUwvpWJnqGpnrGhboWFWbyZiVItyBH5BSWZ+XnQEEguSgU6pghkalBqioJHYomCP9DUYJCpCsFAV2XmpSs4lpZk5APDrVLBQM9AzxBoTCeTDAsDIwcDGysTKFgZuDgFYFHwQYP/r7TdX8MJrlqz/3tPL+rjsZNXsNwX8Vxgc++2GI5dkt4r1r1nmrfdcGVn8tVJMtzTrf7m6F+9v5m7uK54b18F3+1JS5ziwtOfTpSpZs1u4z41o2QHo3HJmQNum0OK5ywoMtB4s8Mh+YVo7FSN7Vpr8/fdkHDPmr/plNTxw5EByZreMicnzhWx1TX94bxkYf9X1gehhstDj5Vu+7G6VTv49O9yx+xah4XC4ccvGyj4y374ql1TcsZwscHEagvz1eeFey97Lkj6nX2y+MyjY3yvJMRbxEvZ/iS9W/+b4+zOGZmHdm6pymfO9104VY3JVeO2V3JvfvKi9KKmXh8xyf/lQlprjI52nomwOOSZfIpBLv7Ezf/r9wQ4Lt81dfuJlfO50uc5p5ybIMD3L6ZywY3EA1yvIkNllmkwCTgc9RDwf7hnqrpoxNeLP75tcY7ekplU3FymE1z7YMIli8Trp3c0VFTFHuibLcGn13Rvu0roraAZBpvXV7vL7mExXjJHaoJlenxeOIvZ85ksH29fe3Cp2lVCp8Kh1KjUeyZ7w8PJX/W0Ppp96TTwUPuXNi/ZxXSpxxy19trJysLbLi5In8sytTB08vRLarfc0hiVXgs7m6f0P7xyYpbzVPbZrPYHnRjfCS9ljFNamXL50KzN6T46hww81YT1W84kzvMNZd/M0B+auvfe758FLnyRM3zfrJ43n2tbF1P3Ph7tqngA kind: ConfigMap metadata: labels: release.openshift.io/verification-signatures: "" namespace: openshift-config-managed
Version-Release number of selected component (if applicable):
[fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-298-ga5a32fa", GitCommit:"a5a32fa3", GitTreeState:"clean", BuildDate:"2024-09-25T08:22:44Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. clone oc-mirror repo, cd oc-mirror, run make build 2. Now use the imageSetConfig.yaml present above and run mirror2disk & disk2mirror commands 3. oc-mirror -c /tmp/clid-232.yaml file://CLID-232 --v2 ; oc-mirror -c /tmp/clid-232.yaml --from file://CLID-232 docker://localhost:5000/clid-232 --dest-tls-verify=false --v2
Actual results:
1. See that signature directory contains more than expected releases as shown in the description 2. Also see that binaryData is duplicated in the signatureconfigmap.yaml
Expected results:
1. Should only see the releases that are defined in the imageSetConfig.yaml in the signatures directory 2. Should not see any duplication of binaryData in the signatureconfigmap.yaml file.
Additional info:
The duplication of controllers for hostedcontrolplane v2 has caused some technical debt.
The new controllers are now out of sync with their v1.
For example:
control-plane-operator/controllers/hostedcontrolplane/v2/cloud_controller_manager/openstack/config.go is missing a feature that was merged between the v2 controller was merged, so it's out of sync.
Description of problem:
Inspection is failing on hosts which special characters found in serial number of block devices: Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: 2024-07-03 09:16:11.325 1 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory'....'error': "The following errors were encountered:\n* collector logs failed: 'utf-8' codec can't decode byte 0xff in position 12: invalid start byte"} call_inspector /usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py:128 Serial found: "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff" Interesting stacktrace error: Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Full stack trace: ~~~ Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: 2024-07-03 09:16:11.628 1 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -bia --json -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID,SERIAL" returned: 0 in 0.006s e xecute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422 Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: --- Logging error --- Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: --- Logging error --- Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Traceback (most recent call last): Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Traceback (most recent call last): Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: stream.write(msg + self.terminator) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Call stack: Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: stream.write(msg + self.terminator) Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/bin/ironic-python-agent", line 10, in <module> Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: sys.exit(run()) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: agent.IronicPythonAgent(CONF.api_url, Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Call stack: Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 485, in run Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: self.process_lookup_data(content) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 400, in process_lookup_data Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: hardware.cache_node(self.node) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3179, in cache_node Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: dispatch_to_managers('wait_for_disks') Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: return getattr(manager, method)(*args, **kwargs) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 997, in wait_for_disks Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: self.get_os_install_device() Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1518, in get_os_install_device Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = self.list_block_devices_check_skip_list( Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1495, in list_block_devices_check_skip_list Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = self.list_block_devices( Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1460, in list_block_devices Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: block_devices = list_all_block_devices() Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 526, in list_all_block_devices Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: report = il_utils.execute('lsblk', '-bia', '--json', Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 111, in execute Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: _log(result[0], result[1]) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 99, in _log Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: LOG.debug('Command stdout is: "%s"', stdout) Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Message: 'Command stdout is: "%s"' Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Arguments: ('{\n "blockdevices": [\n {\n "kname": "loop0",\n "model": null,\n "size": 67467313152,\n "rota": false,\n "type": "loop",\n "uuid": "28f5ff52-7f5b-4e5a-bcf2-59813e5aef5a",\n "partuuid": null,\n "serial": null\n },{\n "kname": "loop1",\n "model": null,\n "size": 1027846144,\n "rota": false,\n "type": "loop",\n "uuid": null,\n "partuuid": null,\n "serial": null\n },{\n "kname": "sda",\n "model": "LITEON IT ECE-12",\n "size": 120034123776,\n "rota": false,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "XXXXXXXXXXXXXXXXXX"\n },{\n "kname": "sdb",\n "model": "LITEON IT ECE-12",\n "size": 120034123776,\n "rota": false,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "XXXXXXXXXXXXXXXXXXXX"\n },{\n "kname": "sdc",\n "model": "External",\n "size": 0,\n "rota": true,\n "type": "disk",\n "uuid": null,\n "partuuid": null,\n "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"\n }\n ]\n}\n',) ~~~
Version-Release number of selected component (if applicable):
OCP 4.14.28
How reproducible:
Always
Steps to Reproduce:
1. Add a BMH with a bad utf-8 characters in serial 2. 3.
Actual results:
Inspection fail
Expected results:
Inspection works
Additional info:
Description of problem:
Selecting Add from Event modal in topology redirects to add page but the event modal to add trigger for a broker persistes
Version-Release number of selected component (if applicable):
How reproducible:
Everytime
Steps to Reproduce:
1. Enable event option in config map of knative-eventing namespace 2. Create a broker and associate an event to it 3. In topology select add trigger for the broker 4. Since no service is created it will ask to go to Add page to create a service so select Add from the modal
Actual results:
The modal persists
Expected results:
The modal should be closed after the user is redirected to the Add page
Additional info:
Adding video of the issue
https://drive.google.com/file/d/16hMbtBj0GeqUOLnUdCTMeYR3exY84oEn/view?usp=sharing
Description of problem:
Rotating the root certificates (root CA) requires multiple certificates during the rotation process to prevent downtime as the server and client certificates are updated in the control and data planes. Currently, the HostedClusterConfigOperator uses the cluster-signer-ca from the control plane to create a kublet-serving-ca on the data plane. The cluster-signer-ca contains only a single certificate that is used for signing certificates for the kube-controller-manager. During a rotation, the kublet-serving-ca will be updated with the new CA which triggers the metrics-server pod to restart and use the new CA. This will lead to an error in the metrics-server where it cannot scrape metrics as the kublet has yet to pickup the new certificate. E0808 16:57:09.829746 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.240.0.29:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="pres-cqogb7a10b7up68kvlvg-rkcpsms0805-default-00000130" rkc@rmac ~> kubectl get pods -n openshift-monitoring NAME READY STATUS RESTARTS AGE metrics-server-594cd99645-g8bj7 0/1 Running 0 2d20h metrics-server-594cd99645-jmjhj 1/1 Running 0 46h The HostedClusterConfigOperator should likely be using the KubeletClientCABundle from the control plane for the kublet-serving-ca in the data plane. This CA bundle will contain both the new and old CA such that all data plane components can remain up during the rotation process.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
the section is: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-arm-tested-machine-types_installing-aws-vpc all tesed arm instances for 4.14+: c6g.* c7g.* m6g.* m7g.* r8g.* we need to ensure all sections include "Tested instance types for AWS on 64-bit ARM infrastructures" section been updated for 4.14+
Additional info:
In 4.17 the openshift installer will have the `create config iso` functionality (see epic). IBIO should stop implementing this logic, instead it should extract the openshift installer from the release image (already part of the ICI CR) and use it to create ethe configuration ISO.
Please review the following PR: https://github.com/openshift/route-controller-manager/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The console crashes when the user selects SSH as the Authentication type for the git server under add secret in the start pipeline form
Version-Release number of selected component (if applicable):
How reproducible:
Everytime. Only in developer perspective and if the Pipelines dynamic plugin is enabled.
Steps to Reproduce:
1. Create a pipeline through add flow and open start pipeline page 2. Under show credentials select add secret 3. In the secret form select `Access to ` as Git server and `Authentication type` as SSH key
Actual results:
Console crashes
Expected results:
UI should work as expected
Additional info:
Attaching console log screenshot
https://drive.google.com/file/d/1bGndbq_WLQ-4XxG5ylU7VuZWZU15ywTI/view?usp=sharing
Please review the following PR: https://github.com/openshift/csi-operator/pull/227
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Dualstack jobs beyond 4.13 (presumably when we added cluster-data.json) are miscategorized as NetworkStack = ipv4 because the code doesn't know how to detect dualstack: https://github.com/openshift/origin/blob/11f7ac3e64e6ee719558fc18d753d4ce1303d815/pkg/monitortestlibrary/platformidentification/types.go#L88
We have the ability to NOT override a variant calculated from jobname if cluster-data disagrees: https://github.com/openshift/sippy/blob/master/pkg/variantregistry/ocp.go#L181
We should fix origin, but we don't want to backport to five releases, so we should also update the variant registry to ignore this field in cluster data is release <= 4.18 (assuming that's where we fix this)
Worth supporting PATCH request in curl_assisted_service func.
E.g. for the appliance flow: https://github.com/openshift/appliance/blob/1c405b5cc722b29edcf4bb6bbe14e44d21a4c066/data/scripts/bin/update-hosts.sh.template#L29-L30
Description of the problem:
Looks like nmstate service enabled on ARM machine .
ARM machine: (Run on CI job)
nvd-srv-17.nvidia.eng.rdu2.redhat.com
[root@worker-0-0 core]# cd /etc/nmstate/ [root@worker-0-0 nmstate]# cat cat catchsegv [root@worker-0-0 nmstate]# ls -l total 8 -rw-r--r--. 1 root root 95 Aug 1 2022 README -rw-------. 1 root root 804 Sep 24 12:36 ymlFile2.yml [root@worker-0-0 nmstate]# cat ymlFile2.yml capture: iface0: interfaces.mac-address == "52:54:00:82:6B:E0" desiredState: dns-resolver: config: server: - 192.168.200.1 interfaces: - ipv4: address: - ip: 192.168.200.53 prefix-length: 24 dhcp: false enabled: true name: "{{ capture.iface0.interfaces.0.name }}" type: ethernet state: up ipv6: address: - ip: fd2e:6f44:5dd8::39 prefix-length: 64 dhcp: false enabled: true routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.200.1 next-hop-interface: "{{ capture.iface0.interfaces.0.name }}" table-id: 254 - destination: ::/0 next-hop-address: fd2e:6f44:5dd8::1 next-hop-interface: "{{ capture.iface0.interfaces.0.name }}" table-id: 254[root@worker-0-0 nmstate]#
[root@worker-0-0 nmstate]# systemctl status nmstate.service ● nmstate.service - Apply nmstate on-disk state Loaded: loaded (/usr/lib/systemd/system/nmstate.service; enabled; preset: enabled) Active: active (exited) since Tue 2024-09-24 12:40:05 UTC; 20min ago Docs: man:nmstate.service(8) https://www.nmstate.io Process: 3427 ExecStart=/usr/bin/nmstatectl service (code=exited, status=0/SUCCESS) Main PID: 3427 (code=exited, status=0/SUCCESS) CPU: 36ms Sep 24 12:40:03 worker-0-0 systemd[1]: Starting Apply nmstate on-disk state... Sep 24 12:40:03 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:03Z INFO nmstatectl] Nmstate version: 2.2.27 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::ip] Static addresses fd2e:6f44:5dd8::39/64 defined when dynamic IP is enabled Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::ip] Static addresses fd2e:6f44:5dd8::39/64 defined when dynamic IP is enabled Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::query_apply::net_state] Created checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::query_apply::net_state] Rollbacked to checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z ERROR nmstatectl::service] Failed to apply state file /etc/nmstate/ymlFile2.yml: NmstateError: NotImplementedError: Autoconf without DHCP is not supported yet Sep 24 12:40:05 worker-0-0 systemd[1]: Finished Apply nmstate on-disk state. [root@worker-0-0 nmstate]# more /usr/lib/systemd/system/nmstate.service [Unit] Description=Apply nmstate on-disk state Documentation=man:nmstate.service(8) https://www.nmstate.io After=NetworkManager.service Before=network-online.target Requires=NetworkManager.service [Service] Type=oneshot ExecStart=/usr/bin/nmstatectl service RemainAfterExit=yes [Install] WantedBy=NetworkManager.service [root@worker-0-0 nmstate]#
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Gather the nodenetworkconfigurationpolicy.nmstate.io/v1 and nodenetworkstate.nmstate.io/v1beta1 cluster scoped resources in the Insights data. This CRs are introduced by the NMState operator.
Description of problem:
A new Chart 'Architecture' is added on Metrics page for some resources eg: Deployments, StatefulSet, and DemonStets, and so on. It will be shown 'no data point found' on the Chart which is not correct The Report issue/Question is: Q1. Should the Chart of 'Architecture' be listed on the Metrics page for those resources? Q2. If yes, it should not shown 'No datapoints found'
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-08-075347
How reproducible:
Always
Steps to Reproduce:
1. Navigate to a resource details page, such as StatefulSet details/ Deployments details page, and go to Metrices tab eg: k8s/ns/openshift-monitoring/statefulsets/alertmanager-main/metrics 2. Check the new chart 'Architecture' 3.
Actual results:
A new chart'Architecture' is listed on Metrics page And the data in the chart return 'no datapoints found'
Expected results:
The chart of 'Architecture' should not exist If it is added by Design, it should not return 'No datapoints found'
Additional info:
For Reference: I think the page is impact by the PR https://github.com/openshift/console/pull/13718
Description of problem:
etcd-operator is using JSON-based client for core object communication. Instead it should use protobuf version
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
When attempting to delete the agentserviceconfig, it gets stuck deleting on the `agentserviceconfig.agent-install.openshift.io/local-cluster-import-deprovision` finalizer.
The following errors is reported by the infrastructure operator pod:
time="2024-09-03T12:57:17Z" level=info msg="AgentServiceConfig (LocalClusterImport) Reconcile started" time="2024-09-03T12:57:17Z" level=error msg="could not delete local cluster ClusterDeployment due to error failed to delete ClusterDeployment in namespace : resource name may not be empty" time="2024-09-03T12:57:17Z" level=error msg="failed to clean up local cluster CRs" error="failed to delete ClusterDeployment in namespace : resource name may not be empty" time="2024-09-03T12:57:17Z" level=info msg="AgentServiceConfig (LocalClusterImport) Reconcile ended" {"level":"error","ts":"2024-09-03T12:57:17Z","msg":"Reconciler error","controller":"agentserviceconfig","controllerGroup":"agent-install.openshift.io","controllerKind":"AgentServiceConfig","AgentServiceConfig":{"name":"agent"},"namespace":"","name":"agent","reconcileID":"470afd7d-ec86-4d45-818f-eb6ebb4caa3d","error":"failed to delete ClusterDeployment in namespace : resource name may not be empty","errorVerbose":"resource name may not be empty\nfailed to delete ClusterDeployment in namespace \ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).deleteClusterDeployment\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:250\ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).ensureLocalClusterCRsDeleted\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:333\ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).Reconcile\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1695","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
How reproducible:
100%
Steps to reproduce:
1. Delete AgentServiceConfig resource
Actual results:
The AgentServiceConfig isn't removed
Expected results:
The AgentServiceConfig is removed
Description of problem:
Using the latest main branch hypershift client to create a 4.15 hc, the capi provider crashed with the logs:
$ oc logs capi-provider-647f454bf-sqq9c Defaulted container "manager" out of: manager, token-minter, availability-prober (init) invalid argument "EKS=false,ROSA=false" for "--feature-gates" flag: unrecognized feature gate: ROSA Usage of /bin/cluster-api-provider-aws-controller-manager: invalid argument "EKS=false,ROSA=false" for "--feature-gates" flag: unrecognized feature gate: ROSA
Version-Release number of selected component (if applicable):
4.15 HC
How reproducible:
100%
Steps to Reproduce:
1. Just use main latest cli to create a public aws 4.15 HC 2. 3.
Actual results:
capi-provider pod crashed
Expected results:
the 4.15 hc could be created successfully
Additional info:
probably related to
slack: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1724249475037359
Description of problem:
Removing third party override of cloud-provider-vsphere's config package
Version-Release number of selected component (if applicable):
4.18, 4.17.z
How reproducible:
Always
Additional info:
Upstream package was overridden to fix logging confusion while we waited for upstream fix. Fix is now ready and the third party override needs to be removed.
Description of problem:
After branching, main branch still publishes Konflux builds to mce-2.7
Version-Release number of selected component (if applicable):
mce-2.7
How reproducible:
100%
Steps to Reproduce:
1.Post a PR to
main
2. Check the jobs that run
Actual results:
Both mce-2.7 and main Konflux builds get triggered
Expected results:
Only main branch Konflux builds gets triggered
Additional info:
Description of problem:
After installed MCE operator, tried to create MultiClusterEngine instance, it failed with error: "error applying object Name: mce Kind: ConsolePlugin Error: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found" Checked in openshift-console-operator, there is not webhook service, also deployment "console-conversion-webhook" is missed.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-25-103421
How reproducible:
Always
Steps to Reproduce:
1.Check resources in openshift-console-opeator, such as deployment and service. 2. 3.
Actual results:
1. There is not webhook related deployment, pod and service.
Expected results:
1. Should have webhook related resources.
Additional info:
Description of problem:
Edit Deployment and Edit DeploymentConfig actions redirect user to project workloads page instead of resource details page
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-22-123921
How reproducible:
Always
Steps to Reproduce:
1. user tries to `Edit Deployment` and `Edit DeploymentConfig` action either in Form or YAML view, save the changes
Actual results:
1. user will be redirected to project workloads page
Expected results:
1. user should be taken to resource details page
Additional info:
Description of problem:
Cancelling the file browser dialog after initial file was previously uploaded causes TypeError crash
Version-Release number of selected component (if applicable):
4.18.0-0.ci-2024-10-30-043000
How reproducible:
always
Steps to Reproduce:
1. user logins to console 2. goes to Secrets -> Create Image pull secret, on the page - Secret name: test-secret - Authentication type: Upload configuration file, here we click on browse and upload some file. 3. then when we try to browse for other file, but instead of uploading another file we cancel the file chooser dialog, the console crash with 'Cannot read properties of undefined (reading 'size')' error.
Actual results:
Console crashes with 'Cannot read properties of undefined (reading 'size')' error
Expected results:
Console should not crash.
Additional info:
Description of problem:
On one ingress details page, click "Edit" button for Labels, it opens annotation edit modal.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-10-133647 4.17.0-0.nightly-2024-09-09-120947
How reproducible:
Always
Steps to Reproduce:
1.Go to one ingress details page, click "Edit" button for Labels. 2. 3.
Actual results:
1. The "Edit annotations" modal is opened.
Expected results:
1. Should open "Edit labels" modal.
Additional info:
Description of problem:
Enabling the Shipwright tests in CI
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In cluster-capi-operator, if the VsphereCluster object gets deleted, the controller attempts to recreate it and fails while trying to also recreate its corresponding vsphere credentials secret, which instead still exists. The failure is highlighted by the following logs in the controller: `resourceVersion should not be set on objects to be created`
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Delete VsphereCluster 2. Check the cluster-capi-operator logs 3.
Actual results:
VsphereCluster fails to be recreated as the reconciliation fails during ensuring the vsphere credentials secret
Expected results:
VsphereCluster gets recreated
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/231
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
On 1.8.2024, assisted-installer-agent job started failing subsystem test "add_multiple_servers". We need to make sure it is occurs only in tests and The fix should be backported.
Description of problem:
there is a spelling error for word `instal` , it should be `install`
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-03-211053
How reproducible:
Always
Steps to Reproduce:
1. normal user open Lightspeed hover button, check the messages 2. 3.
Actual results:
Must have administrator accessContact your administrator and ask them to instal Red Hat OpenShift Lightspeed.
Expected results:
word `instal` should be `install`
Additional info:
Description of problem:
When we enable OCB functionality and we create a MC that configures an eforcing=0 kernel argumnent the MCP is degraded reporting this message { "lastTransitionTime": "2024-05-30T09:37:06Z", "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" },
Version-Release number of selected component (if applicable):
IPI on AWS $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-30-021120 True False 97m Error while reconciling 4.16.0-0.nightly-2024-05-30-021120: the cluster operator olm is not available
How reproducible:
Alwasy
Steps to Reproduce:
1. Enable techpreview $ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}' 2. Configure a MSOC resource to enable OCB functionality in the worker pool When we hit this problem we were using the mcoqe quay repository. A copy of the pull-secret for baseImagePullSecret and renderedImagePushSecret and no currentImagePullSecret configured. apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: machineConfigPool: name: worker # buildOutputs: # currentImagePullSecret: # name: "" buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: pull-copy renderedImagePushSecret: name: pull-copy renderedImagePushspec: "quay.io/mcoqe/layering:latest" 3. Create a MC to use enforing=0 kernel argument { "kind": "List", "apiVersion": "v1", "metadata": {}, "items": [ { "apiVersion": "machineconfiguration.openshift.io/v1", "kind": "MachineConfig", "metadata": { "labels": { "machineconfiguration.openshift.io/role": "worker" }, "name": "change-worker-kernel-selinux-gvr393x2" }, "spec": { "config": { "ignition": { "version": "3.2.0" } }, "kernelArguments": [ "enforcing=0" ] } } ] }
Actual results:
The worker MCP is degraded reporting this message: oc get mcp worker -oyaml .... { "lastTransitionTime": "2024-05-30T09:37:06Z", "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" },
Expected results:
The MC should be applied without problems and selinux should be using enforcing=0
Additional info:
Description of problem:
In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>): ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: type: Route ~~~ On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: route: hostname: oauth.<custom-domain> type: Route $ oc get routes -n hcp-ns --show-labels NAME HOST/PORT LABELS oauth oauth.<custom-domain> hypershift.openshift.io/hosted-control-plane=hcp-ns <--- ~~~ The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: ~~~ $ oc get ingresscontroller -n openshift-ingress-default default -oyaml routeSelector: matchExpressions: - key: hypershift.openshift.io/hosted-control-plane <--- operator: DoesNotExist <--- ~~~ This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Easily
Steps to Reproduce:
1. Install HCP cluster 2. Configure OAuthServer with type Route 3. Add a custom hostname different than default wildcard ingress URL from management cluster
Actual results:
Oauth route is not admitted
Expected results:
Oauth route should be admitted by Ingresscontroller
Additional info:
Version of components:
OCP version
4.16.0-0.nightly-2024-11-05-003735
Operator bundle: quay.io/rhobs/observability-operator-bundle:0.4.3-241105092032
Description of issue:
When Tracing UI plugin instance is created. The distributed-tracing-* pod shows the following errors and the Tracing UI is not available in the OCP web console.
% oc logs distributed-tracing-745f655d84-2jk6b time="2024-11-05T13:08:37Z" level=info msg="enabled features: []\n" module=main time="2024-11-05T13:08:37Z" level=error msg="cannot read base manifest file" error="open web/dist/plugin-manifest.json: no such file or directory" module=manifest time="2024-11-05T13:08:37Z" level=info msg="listening on https://:9443" module=server I1105 13:08:37.620932 1 tlsconfig.go:240] "Starting DynamicServingCertificateController" 10.128.0.109 - - [05/Nov/2024:13:08:54 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62 10.128.0.109 - - [05/Nov/2024:13:08:54 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62 10.128.0.109 - - [05/Nov/2024:13:09:10 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62 10.128.0.109 - - [05/Nov/2024:13:09:25 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
Steps to reproduce the issue:
*Instal the latest operator bundle.
quay.io/rhobs/observability-operator-bundle:0.4.3-241105092032
*Set the -openshift.enabled flag in the CSV.
*Create the Tracing UI plugin instance and check the UI plugin pod logs.
Description of problem: If a customer applies ethtool configuration to the interface used in br-ex, that configuration will be dropped when br-ex is created. We need to read and apply the configuration from the interface to the phys0 connection profile, as described in https://issues.redhat.com/browse/RHEL-56741?focusedId=25465040&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25465040
Version-Release number of selected component (if applicable): 4.16
How reproducible: Always
Steps to Reproduce:
1. Deploy a cluster with an NMState config that sets the ethtool.feature.esp-tx-csum-hw-offload field to "off"
2.
3.
Actual results: The ethtool setting is only applied to the interface profile which is disabled after configure-ovs runs
Expected results: The ethtool setting is present on the configure-ovs-created profile
Additional info:
Affected Platforms: VSphere. Probably baremetal too and possibly others.
Description of problem:
The whereabouts kubeconfig is known to expire, if the cluster credentials and kubernetes secret changes, the whereabouts kubeconfig (which is stored on disk) is not updated to reflect the credential change
Version-Release number of selected component (if applicable):
>= 4.8.z (all OCP versions which ship Whereabouts)
How reproducible:
With time.
Steps to Reproduce:
1. Wait for cluster credentials to expire (which may take a year depending on cluster configuration) (currently unaware of a technique to force a credentials change to the serviceaccount secret token)
Actual results:
Kubeconfig is out of date and Whereabouts cannot properly authenticate with API server
Expected results:
Kubeconfig is updated and Whereabouts can authenticate with API server
Description of the problem:
Trying to create cluster (Multi - operators : mtv + cnv + lvms) with minimal requirements
according to preflight response (attached below):
We should need 5 vcpu cores as minimal req:
however when creating the cluster it is asking for 6instead of 5
tooltip says
Require at least 6 CPU cores for worker role, found only 5.
{"ocp":{"master":{"qualitative":null,"quantitative":{"cpu_cores":4,"disk_size_gb":20,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0,"ram_mib":16384}},"worker":{"qualitative":null,"quantitative":{"cpu_cores":2,"disk_size_gb":20,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10,"ram_mib":8192}}},"operators":[{"dependencies":[],"operator_name":"lso","requirements":{"master":{"qualitative":null,"quantitative":{}},"worker":{"qualitative":null,"quantitative":{}}}},{"dependencies":["lso"],"operator_name":"odf","requirements":{"master":{"qualitative":["Requirements apply only for master-only clusters","At least 3 hosts","At least 1 non-boot SSD or HDD disk on 3 hosts"],"quantitative":{"cpu_cores":6,"ram_mib":19456}},"worker":{"qualitative":["Requirements apply only for clusters with workers","5 GiB of additional RAM for each non-boot disk","2 additional CPUs for each non-boot disk","At least 3 workers","At least 1 non-boot SSD or HDD disk on 3 workers"],"quantitative":{"cpu_cores":8,"ram_mib":19456}}}},{"dependencies":["lso"],"operator_name":"cnv","requirements":{"master":{"qualitative":["Additional 1GiB of RAM per each supported GPU","Additional 1GiB of RAM per each supported SR-IOV NIC","CPU has virtualization flag (vmx or svm)"],"quantitative":{"cpu_cores":4,"ram_mib":150}},"worker":{"qualitative":["Additional 1GiB of RAM per each supported GPU","Additional 1GiB of RAM per each supported SR-IOV NIC","CPU has virtualization flag (vmx or svm)"],"quantitative":{"cpu_cores":2,"ram_mib":360}}}},{"dependencies":[],"operator_name":"lvm","requirements":{"master":{"qualitative":["At least 1 non-boot disk per host","100 MiB of additional RAM","1 additional CPUs for each non-boot disk"],"quantitative":{"cpu_cores":1,"ram_mib":100}},"worker":{"qualitative":null,"quantitative":{}}}},{"dependencies":[],"operator_name":"mce","requirements":{"master":{"qualitative":[],"quantitative":{"cpu_cores":4,"ram_mib":16384}},"worker":{"qualitative":[],"quantitative":{"cpu_cores":4,"ram_mib":16384}}}},{"dependencies":["cnv"],"operator_name":"mtv","requirements":{"master":{"qualitative":["1024 MiB of additional RAM","1 additional CPUs"],"quantitative":{"cpu_cores":1,"ram_mib":1024}},"worker":{"qualitative":["1024 MiB of additional RAM","1 additional CPUs"],"quantitative":{"cpu_cores":1,"ram_mib":1024}}}}]}
How reproducible:
100%
Steps to reproduce:
1. create a multi cluster
2. select mtv + lvms + cnv
3. add 5 cpu cores work node
Actual results:
unaable to continue installation process cluster asking for an extra cpu core
Expected results:
should be bale to isntall cluster 5 cpu should be enough
Description of problem:
Trying to install AWS EFS Driver 4.15 in 4.16 OCP. And driver pods get stuck with the below error: $ oc get pods NAME READY STATUS RESTARTS AGE aws-ebs-csi-driver-controller-5f85b66c6-5gw8n 11/11 Running 0 80m aws-ebs-csi-driver-controller-5f85b66c6-r5lzm 11/11 Running 0 80m aws-ebs-csi-driver-node-4mcjp 3/3 Running 0 76m aws-ebs-csi-driver-node-82hmk 3/3 Running 0 76m aws-ebs-csi-driver-node-p7g8j 3/3 Running 0 80m aws-ebs-csi-driver-node-q9bnd 3/3 Running 0 75m aws-ebs-csi-driver-node-vddmg 3/3 Running 0 80m aws-ebs-csi-driver-node-x8cwl 3/3 Running 0 80m aws-ebs-csi-driver-operator-5c77fbb9fd-dc94m 1/1 Running 0 80m aws-efs-csi-driver-controller-6c4c6f8c8c-725f4 4/4 Running 0 11m aws-efs-csi-driver-controller-6c4c6f8c8c-nvtl7 4/4 Running 0 12m aws-efs-csi-driver-node-2frs7 0/3 Pending 0 6m29s aws-efs-csi-driver-node-5cpb8 0/3 Pending 0 6m26s aws-efs-csi-driver-node-bchg5 0/3 Pending 0 6m28s aws-efs-csi-driver-node-brndb 0/3 Pending 0 6m27s aws-efs-csi-driver-node-qcc4m 0/3 Pending 0 6m27s aws-efs-csi-driver-node-wpk5d 0/3 Pending 0 6m27s aws-efs-csi-driver-operator-6b54c78484-gvxrt 1/1 Running 0 13m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 6m58s default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 3m42s (x2 over 4m24s) default-scheduler 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
all the time
Steps to Reproduce:
1. Install AWS EFS CSI driver 4.15 in 4.16 OCP 2. 3.
Actual results:
EFS CSI drive node pods are stuck in pending state
Expected results:
All pod should be running.
Additional info:
More info on the initial debug here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1715757611210639
Description of problem:
In 4.18 Azure Stack Hub cluster, Azure-Disk CSI Driver doesn't doesn't work with following error when provisioning volume:
E1024 05:36:01.335536 1 utils.go:110] GRPC error: rpc error: code = Internal desc = PUT https://management.mtcazs.wwtatc.com/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/ci-op-wv5kxjrl-cc5c6/providers/Microsoft.Compute/disks/pvc-854653a6-6107-44ff-95e3-a6d588864420 -------------------------------------------------------------------------------- RESPONSE 400: 400 Bad Request ERROR CODE: NoRegisteredProviderFound -------------------------------------------------------------------------------- { "error": { "code": "NoRegisteredProviderFound", "message": "No registered resource provider found for location 'mtcazs' and API version '2023-10-02' for type 'disks'. The supported api-versions are '2017-03-30, 2018-04-01, 2018-06-01, 2018-09-30, 2019-03-01, 2019-07-01, 2019-11-01'. The supported locations are 'mtcazs'." } } --------------------------------------------------------------------------------
Version-Release number of selected component (if applicable):
OCP:4.18.0-0.nightly-2024-10-23-112324 AzureDisk CSI Driver: v1.30.4
How reproducible:
Always
Steps to Reproduce:
1. Create cluster on Azure Stack Hub with prometheus pvc configurated 2. Volume provisioning failed due to "NoRegisteredProviderFound"
Actual results:
Volume provisioning failed
Expected results:
Volume provisioning should succeed
Additional info:
Summary
Duplicate issue of https://issues.redhat.com/browse/OU-258.
To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board.
This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin.
openshift/console PR#4187: Removes the Metrics Page.
openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page.
—
Testing
Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.
—
Except from OU-258 : https://issues.redhat.com/browse/OU-258 :
The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.
OCPBUGS-36283 introduced the ability to switch on TLS between the BMC and the Metal3's httpd server. It is currently off by default to make the change backportable without a high risk of regressions. We need to turn it on for 4.18+ for consistency with CBO-deployed Metal3.
Description of problem:
The kubeconfigs for the DNS Operator and the Ingress Operator are managed by Hypershift and they should only be managed by the cloud service provider. This can lead to the kubeconfig/certificate being invalid in the cases where the cloud service provider further manages the kubeconfig (for example ca-rotation).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Circular dependencies in OCP Console prevent migration of Webpack 5
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Enable the CHECK_CYCLES env var while building 2. Observe errors 3.
Actual results:
There are errors
Expected results:
No errors
Additional info:
Description of problem:
The OpenShift Pipelines operator automatically installs a OpenShift console plugin. The console plugin metrics reports this as unknown after the plugin was renamed from "pipeline-console-plugin" to "pipelines-console-plugin".
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
Always
Steps to Reproduce:
Actual results:
It shows an "unknown" plugin in the metrics.
Expected results:
It should shows a "pipelines" plugin in the metrics.
Additional info:
None
Description of problem:
We are in a live migration scenario.
If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.
I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.
Version-Release number of selected component (if applicable):
4.16.13
How reproducible:
Always
Steps to Reproduce:
1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.
2. Start the migration
3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)
Actual results:
Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.
Expected results:
Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.
Additional info:
This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.
This is a customer issue. More details to be included in private comments for privacy.
Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).
operator conditions kube-apiserver
is showing as regressed in 4.17 (and 4.18) for metal and vsphere
Stephen Benjamin noted there is one line of JQ used to create the tests and has offered to try to stabilize that code some. Ultimately TRT-1764 is intended to build out a smarter framework. This bug is to see what can be done in the short term.
Description of problem:
Shipwright operator installation through CLI is failing - Failure: # Shipwright build details page.Shipwright build details page Shipwright tab should be default on first open if the operator is installed (ODC-7623): SWB-01-TC01 Error: Failed to install Shipwright Operator - Pod timeout
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In the Secret details view, if one of the data properties from the Secret contains a tab character, it is considered "unpritable" and the content cannot be viewed in the console. This is not correct. Tab characters can be printed and should not prevent content from being viewed. We have a dependency "istextorbinary" that will determine if a buffer contains binary. We should use it here.
Version-Release number of selected component (if applicable) 4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Download (this file)[https://gist.github.com/TheRealJon/eb1e2eaf80c923938072f8a997fed3cd/raw/04b7307d31a825ae686affd9da0c0914d490abd3/pull-secret-with-tabs.json] 2. Run this command: oc create secret generic test -n default --from-file=.dockerconfigjson=<path-to-file-from-step-1> --type=kubernetes.io/dockerconfigjson 3. In the console, navigate to Workloads -> Secrets and make sure that the "default" project is selected from the project dropdown. 4. Select named "test" 5. Scroll to the bottom to view the data content of the Secret
Actual results:
The "Save this file" option is shown, and user is unable to reveal the contents of the Secret
Expected results:
The "Save this file" option should not be shown, the obfuscated content should be rendered, and the reveal/hide button should show and hide the content from the pull secret.
Additional info:
There is logic in this view that prevents us from trying to render binary data by detecting "unprintable characters". The regex for this includes the Tab character, which is incorrect, since that character is printable.
Refactor name to Dockerfile.ocp as a better, version independent alternative
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
[sig-arch] events should not repeat pathologically for ns/openshift-machine-api
The machine-api resource seems to not be responding to the `/healthz` requests from kubelet causing an increase in probe error events. The pod does seem to be up, and preliminary look at Loki is showing that the `/healthz` endpoint does seem to be up, but looses leader between, before starting the health probe again.
(read from bottom up)
I1016 19:51:31.418815 1 server.go:191] "Starting webhook server" logger="controller-runtime.webhook" I1016 19:51:31.418764 1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false I1016 19:51:31.418703 1 server.go:83] "starting server" name="health probe" addr="[::]:9441" I1016 19:51:31.418650 1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics" 2024/10/16 19:51:31 Starting the Cmd. ... 2024/10/16 19:50:44 leader election lost I1016 19:50:44.406280 1 leaderelection.go:297] failed to renew lease openshift-machine-api/cluster-api-provider-machineset-leader: timed out waiting for the condition error E1016 19:50:44.406230 1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-machineset-leader": context deadline exceeded error E1016 19:50:37.430054 1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path error E1016 19:50:04.423920 1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cluster-api-provider-machineset-leader) error E1016 19:49:04.422237 1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path .... I1016 19:46:21.358989 1 server.go:83] "starting server" name="health probe" addr="[::]:9441" I1016 19:46:21.358891 1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false I1016 19:46:21.358682 1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics" 2024/10/16 19:46:21 Starting the Cmd.
Description of problem:
Circular dependencies in OCP Console prevent migration of Webpack 5
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Enable the CHECK_CYCLES env var while building 2. Observe errors 3.
Actual results:
There are errors
Expected results:
No errors
Additional info:
Description of problem:
The image ecosystem testsuite sometimes fails due to timeouts in samples smoke tests in origin - the tests starting with "[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories". These can be caused by either the build taking too long (for example the rails application tends to take quite a while to build) or the application actually can start quite slowly. There is no bullet proof solution here but to try and increase the timeouts to a value that both provides enough time and doesn't stall the testsuite for too long.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Run the image-ecosystem testsuite 2. 3.
Actual results:
sometime the testsuite fails because of timeouts
Expected results:
no timeouts
Additional info:
Description of problem:
ConsolePlugin example YAML lacks required data
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-30-231249
How reproducible:
Always
Steps to Reproduce:
1. goes to ConsolePlugins list page /k8s/cluster/customresourcedefinitions/consoleplugins.console.openshift.io/instances or /k8s/cluster/console.openshift.io~v1~ConsolePlugin 2. Click on 'Create ConsolePlugin' button
Actual results:
Example YAML is quite simple and lacking of required data, user will get various error if trying from example YAML apiVersion: console.openshift.io/v1 kind: ConsolePlugin metadata: name: example spec: {}
Expected results:
we should add complete YAML as as example or create a default Sample
Additional info:
Description of problem:
Add two new props to VirtualizedTable in order to make the header checkbox work. allRowsSelected and canSelectAll. allRowsSelected will check the checkbox and canSelectAll will be a control to hide or show the header checkbox.
Description of problem:
When the vSphere CSI driver is removed (using managementState: Removed), it leaves all existing conditions in the ClusterCSIDriver. IMO it should delete all of them and keep some something like"Disabled: true" that we use for Manila CSI driver operator.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-09-031511
How reproducible: always
Steps to Reproduce:
Actual results: All Deployment + DaemonSet conditions are present
Expected results: The conditions are pruned.
Description of problem:
OpenShift automatically installs the OpenShift networking plugin, but the console plugin metrics reports this as "unknown".
Version-Release number of selected component (if applicable):
4.17+ ???
How reproducible:
Always
Steps to Reproduce:
Actual results:
It shows an "unknown" plugin in the metrics.
Expected results:
It should shows a "networking" plugin in the metrics.
Additional info:
None
Description of problem:
While working on the readiness probes we have discovered that the single member health check always allocates a new client. Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check. This should reduce CEO's and etcd CPU consumption.
Version-Release number of selected component (if applicable):
any supported version
How reproducible:
always, but technical detail
Steps to Reproduce:
na
Actual results:
CEO creates a new etcd client when it is checking a single member health
Expected results:
CEO should use the existing pooled client to check for single member health
Additional info:
Description of problem:
HyperShift currently runs 3 replicas of active/passive HA deployments such as kube-controller-manager, kube-scheduler, etc. In order to reduce the overhead of running a HyperShift control plane, we should be able to run these deployments with 2 replicas. In a 3 zone environment with 2 replicas, we can still use a rolling update strategy, and set the maxSurge value to 1, as the new pod would schedule into the unoccupied zone.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/172
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/images/pull/193
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
openshift install fails with "failed to lease wait: Invalid configuration for device '0'. generated yaml below: additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: XXX compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: vsphere: coresPerSocket: 2 cpus: 8 memoryMB: 40960 osDisk: diskSizeGB: 150 zones: - generated-failure-domain replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: vsphere: coresPerSocket: 2 cpus: 4 memoryMB: 32768 osDisk: diskSizeGB: 150 zones: - generated-failure-domain replicas: 3 metadata: creationTimestamp: null name: dc3 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: vsphere: apiVIP: 172.21.0.20 apiVIPs: - 172.21.0.20 cluster: SA-LAB datacenter: OVH-SA defaultDatastore: DatastoreOCP failureDomains: - name: generated-failure-domain region: generated-region server: XXX topology: computeCluster: /OVH-SA/host/SA-LAB datacenter: OVH-SA datastore: /OVH-SA/datastore/DatastoreOCP networks: - ocpdemo resourcePool: /OVH-SA/host/SA-LAB/Resources zone: generated-zone ingressVIP: 172.21.0.21 ingressVIPs: - 172.21.0.21 network: ocpdemo ~~~ Truncated~~~
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1.openshift-install create cluster 2.choose Vsphere 3.
Actual results:
Error
Expected results:
Cluster creation
Additional info:
Description of problem:
regular user can update route spec.tls.certificate/key without extra permissions, but if the user try to edit/patch spec.tls.externalCertificate, it reports error: spec.tls.externalCertificate: Forbidden: user does not have update permission on custom-host
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-05-21-221942
How reproducible:
100%
Steps to Reproduce:
1. login as regular use and create namespace, pod, svc and edge route $ oc create route edge myedge --service service-unsecure --cert tls.crt --key tls.key $ oc get route myedge -oyaml 2. edit the route and remove one certificate from spec.tls.certificate $ oc edit route myedge $ oc get route myedge 3. edit the route and restore the original spec.tls.certificate 4. edit the route with spec.tls.externalCertificate
Actual results:
1. edge route is admitted and works well $ oc get route myedge -oyaml <......> spec: host: myedge-test3.apps.hongli-techprev.qe.azure.devcluster.openshift.com port: targetPort: http tls: certificate: | -----BEGIN CERTIFICATE----- XXXXXXXXXXXXXXXXXXXXXXXXXXX -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- XXXXXXXXXXXXXXXXXXXXXXXX -----END CERTIFICATE----- key: | -----BEGIN RSA PRIVATE KEY----- <......> 2. route is failed validation since "private key does not match public key" $ oc get route myedge NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD myedge ExtendedValidationFailed service-unsecure http edge None 3. route is admitted again after the spec.tls.certificate is restored 4. reports error when updating spec.tls.externalCertificate spec.tls.externalCertificate: Forbidden: user does not have update permission on custom-host
Expected results:
user can has same permission to update both spec.tls.certificate and spec.tls.externalCertificate
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/161
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
oc-mirror produces images signature config maps in JSON format, inconsistent with other manifests which are normally in YAML. That breaks some automation, especially Multicloud Operators Subscription controller which expects manifests in YAML only.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Perform release payload mirroring as documented 2. Check 'release-signatures' directory
Actual results:
There is a mix of YAML and JSON files with kubernetes manifests.
Expected results:
Manifests are stored in one format, either YAML or JSON
Additional info:
Description of problem:
An unexpected validation failure occurs when creating the agent ISO image if the RendezvousIP is a substring of the next-hop-address set for a worker node.
For example this configuration snippet in agent-config.yaml:
apiVersion: v1alpha1 kind: AgentConfig metadata: name: agent-config rendezvousIP: 7.162.6.1 hosts: ... - hostname: worker-0 role: worker networkConfig: interfaces: - name: eth0 type: Ethernet state: up ipv4: enabled: true address: - ip: 7.162.6.4 prefix-length: 25 dhcp: false routes: config: - destination: 0.0.0.0/0 next-hop-address: 7.162.6.126 next-hop-interface: eth0 table-id: 254
Will result in the validation failure when creating the image:
FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to fetch dependency of "Agent Installer Artifacts": failed to fetch dependency of "Agent Installer Ignition": failed to fetch dependency of "Agent Manifests": failed to fetch dependency of "NMState Config": failed to generate asset "Agent Hosts": invalid Hosts configuration: [Hosts[3].Host: Forbidden: Host worker-0 has role 'worker' and has the rendezvousIP assigned to it. The rendezvousIP must be assigned to a control plane host.
The problem is this check here https://github.com/openshift/installer/pull/6716/files#diff-fa305fe33630f77b65bd21cc9473b620f67cfd9ce35f7ddf24d03b26ec2ccfffR293
Its checking for the IP in the raw nmConfig. The problem is the routes stanza is also included in the nmConfig and the route is
next-hop-address: 7.162.6.126
So when rendezvousIP is 7.162.6.1 that strings.Contains() check returns true and the validation fails.
Some time users wants to create some modifications while installing ibi, like create new partitions for the disk, in order to save them and not to override them by coreos installer command we need a way to provide params to coreos installer command
Description of problem:
The e2e test, TestMetrics, is repeatedly failing with the following failure message: === RUN TestMetrics utils.go:135: Setting up pool metrics utils.go:636: Applied label "node-role.kubernetes.io/metrics" to node ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q utils.go:722: Created MachineConfigPool "metrics" utils.go:140: Target Node: ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q utils.go:124: No MachineConfig provided, will wait for pool "metrics" to include MachineConfig "00-worker" utils.go:252: Pool metrics has rendered configs [00-worker] with rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 6.039157947s) utils.go:286: Pool metrics has completed rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 1m14.043792995s) utils.go:145: Error Trace: /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:145 /go/src/github.com/openshift/machine-config-operator/test/e2e/mco_test.go:149 Error: Expected nil, but got: &fmt.wrapError{msg:"node config change did not occur (waited 37.479869ms): nodes \"ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q\" not found", err:(*errors.StatusError)(0xc00071a8c0)} Test: TestMetrics
Version-Release number of selected component (if applicable):
How reproducible:
Sporadically, but could potentially block e2e.
Steps to Reproduce:
Run the e2e-gcp-op test
Actual results:
=== RUN TestMetrics utils.go:135: Setting up pool metrics utils.go:636: Applied label "node-role.kubernetes.io/metrics" to node ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q utils.go:722: Created MachineConfigPool "metrics" utils.go:140: Target Node: ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q utils.go:124: No MachineConfig provided, will wait for pool "metrics" to include MachineConfig "00-worker" utils.go:252: Pool metrics has rendered configs [00-worker] with rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 6.039157947s) utils.go:286: Pool metrics has completed rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 1m14.043792995s) utils.go:145: Error Trace: /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:145 /go/src/github.com/openshift/machine-config-operator/test/e2e/mco_test.go:149 Error: Expected nil, but got: &fmt.wrapError{msg:"node config change did not occur (waited 37.479869ms): nodes \"ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q\" not found", err:(*errors.StatusError)(0xc00071a8c0)} Test: TestMetrics
Expected results:
The test should pass
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/8960
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Following up from OCPBUGS-16357, we should enable health check of stale registration sockets in our operators.
We will need - https://github.com/kubernetes-csi/node-driver-registrar/pull/322 and we will have to enable healthcheck for registration sockets - https://github.com/kubernetes-csi/node-driver-registrar#example
Description of problem:
EncryptionAtHost and DiskEncryptionSets are two features which should not be tightly coupled. They should be able to be enabled / disabled independently. Currently EncryptionAtHost is only enabled if DiskEncryptionSetID is a valid disk encryption set resource ID. https://github.com/openshift/hypershift/blob/0cc82f7b102dcdf6e5d057255be1bdb1593d1203/hypershift-operator/controllers/nodepool/azure.go#L81-L88
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1.See comments
Actual results:
EncryptionAtHost is only set if DiskEncryptionSetID is set.
Expected results:
EncryptionAtHost and DiskEncryptionSetID should be independently settable.
Additional info:
https://redhat-external.slack.com/archives/C075PHEFZKQ/p1724772123804009
The customer's cloud credentials operator generates millions of the below messages per day in the GCP cluster.
And they want to reduce/stop these logs as it is consuming more disks. Also, their "cloud credentials" operator runs in manual mode.
time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds time="2024-06-21T08:37:42Z" level=error msg="error creating GCP client" error="Secret \"gcp-credentials\" not found" time="2024-06-21T08:37:42Z" level=error msg="error determining whether a credentials update is needed" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm error="unable to check whether credentialsRequest needs update" time="2024-06-21T08:37:42Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials time="2024-06-21T08:37:42Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials time="2024-06-21T08:37:42Z" level=info msg="reconciling clusteroperator status" time="2024-06-21T08:37:42Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator time="2024-06-21T08:37:42Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds
Description of problem:
When the user selects a shared vpc install, the created control plane service account is left over. To verify, after the destruction of the cluster check the principals in the host project for a remaining name XXX-m@some-service-account.com
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
No principal remaining
Additional info:
There were remaining issues from the original issue. A new bug has been opened to address this. This is a clone of issue OCPBUGS-32947. The following is the description of the original issue:
—
Description of problem:
[vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-23-032717
How reproducible:
Always
Steps to Reproduce:
1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-04-23-032717 True False 24m Cluster version is 4.16.0-0.nightly-2024-04-23-032717 2.Check the controlplanemachineset, you can see network.devices, template and workspace have value. liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 51m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T02:52:11Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl name: cluster namespace: openshift-machine-api resourceVersion: "18273" uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Active strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: - networkName: devqe-segment-221 numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone userDataSecret: name: master-user-data workspace: datacenter: DEVQEdatacenter datastore: /DEVQEdatacenter/datastore/vsanDatastore folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources server: vcenter.devqe.ibmc.devcluster.openshift.com status: conditions: - lastTransitionTime: "2024-04-25T02:59:37Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:03:45Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:01:04Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values before are now cleared. liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster controlplanemachineset.machine.openshift.io "cluster" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Inactive 6s liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2024-04-25T03:45:51Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 name: cluster namespace: openshift-machine-api resourceVersion: "46172" uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Inactive strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: VSphere vsphere: - name: generated-failure-domain metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: null numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: "" userDataSecret: name: master-user-data workspace: {} status: conditions: - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Error - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-04-25T03:45:51Z" message: "" observedGeneration: 1 reason: AllReplicasUpdated status: "False" type: Progressing observedGeneration: 1 readyReplicas: 3 replicas: 3 updatedReplicas: 3 4.I active the controlplanemachineset and it does not trigger an update, I continue to add these field values back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. network: devices: - networkName: devqe-segment-221 - networkName: devqe-segment-222 By the way, I can create worker machines with other network device or two network devices. huliu-vs425c-f5tfl-worker-0a-ldbkh Running 81m huliu-vs425c-f5tfl-worker-0aa-r8q4d Running 70m
Actual results:
network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update
Expected results:
The fields value should not be changed when deleting the controlplanemachineset, Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.
Additional info:
Must gather: https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/67
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/images/pull/194
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In vSphere cluster, change clustercsidrivers.managementState to "Removed" from "Managed", the check of VSphereProblemDetector become less frequent(once in 24 hours), see log: Scheduled the next check in 24h0m0. It is as expect. Then change clustercsidrivers.managementState to "Managed" from "Removed", the VSphereProblemDetector check frequency is still 24 hours.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-01-175607
How reproducible:
Always
Steps to Reproduce:
See Description
Actual results:
The VSphereProblemDetector check frequency is once in 24 hour
Expected results:
The VSphereProblemDetector check frequency should become to 1 hour
Additional info:
Component Readiness has found a potential regression in the following test:
[sig-mco][OCPFeatureGate:ManagedBootImages][Serial] Should degrade on a MachineSet with an OwnerReference [apigroup:machineconfiguration.openshift.io] [Suite:openshift/conformance/serial]
New feature went live that ensures new tests in a release have at least a 95% pass rate. This test was one that showed up immediately with a couple bad runs in the last 20 attempts. The failures look similar which indicate the test probably has a problem that could be fixed.
We suspect a timeout issue, the test takes about 25s on average with a 30s timeout.
Test has a 91.67% pass rate, but 95.00% is required.
Sample (being evaluated) Release: 4.18
Start Time: 2024-10-10T00:00:00Z
End Time: 2024-10-17T23:59:59Z
Success Rate: 91.67%
Successes: 22
Failures: 2
Flakes: 0
Insufficient pass rate
Description of problem:
Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.
Version-Release number of selected component (if applicable):
How reproducible:
The installation procedure fails systemically when using a predefined VPC
Steps to Reproduce:
1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC 2. Run `openshift-install create cluster ...' 3. The procedure fails: `failed to create load balancer`
Actual results:
The installation procedure fails.
Expected results:
An OCP cluster to be provisioned in AWS, with public subnets only.
Additional info:
The on-prem-resolv-prepender.path is enabled in UPI setup when it should only run for IPI
Description of problem:
clear all filters button is counted as part of resource type
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-08-19-002129
How reproducible:
Always
Steps to Reproduce:
1. navigate to Home -> Events page, choose 3 resource types, check what's shown on page 2. navigate to Home -> Search page, choose 3 resource types, check what's shown on page. Choose 4 resource types and check what's shown
Actual results:
1. it shows `1 more`, only clear all button will be shown if we click on `1 more` button 2. `1 more` button is only displayed when 4 resource types are selected, this is working as expected
Expected results:
1. clear all button should not be counted as part of resource number, the number 'N more' should reveal correct resource type numbers
Additional info:
Description of problem:
cluster-capi-operator's manifests-gen tool would generate CAPI providers transport configmaps with missing metadata details
Version-Release number of selected component (if applicable):
4.17, 4.18
How reproducible:
Not impacting payload, only a tooling bug
Description of problem:
On CI all the software for openstack and ansible related pieces are taken from pip and ansible-glalaxy instead of OS repository.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-08-15-212448
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", then optionally insert interested settings (see [1]) 2. "create cluster", and make sure the cluster turns healthy finally (see [2]) 3. check the cluster's addresses on GCP (see [3]) 4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])
Actual results:
The global address "<infra id>-apiserver" is not deleted during "destroy cluster".
Expected results:
Everything of the cluster shoudl get deleted during "destroy cluster".
Additional info:
FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306
Description of problem:
Difficult to detect in which component I should report this bug. The description is the following. Today we can install RH operators using one precise namespace or just all namepaces that will install the operator in "openshift-operators" namespace. if this operator creates a serviceMonitor that should be scrapped by platform prometheus, this will have a token authentication and security configured in its definition. But if the operator is installed in "openshift-operators" namespace, it's user workload monitoring that will try to scrappe it since this mentioned namespace has not the corresponding label to be scrapped by platform monitoring and we don't want it to have it because in this namespace we can also install community operators. The result is that user workload monitoring will scrap this namespace and the service monitors will be skipped since they are configured with security against platform monitoring and UWM will not hande this. A possible workaround is to do: oc label namespace openshift-operators openshift.io/user-monitoring=false losing functionality since some RH operators will not be monitored if installed in openshift-operators.
Version-Release number of selected component (if applicable):
4.16
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/376
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The installation of compact and HA clusters is failing in the vSphere environment. During the cluster setup, two master nodes were observed to be in a "Not Ready" state, and the rendezvous host failed to join the cluster.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-25-131159
How reproducible:
100%
Actual results:
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected level=info msg=Use the following commands to gather logs from the cluster level=info msg=openshift-install gather bootstrap --help level=error msg=Bootstrap failed to complete: : bootstrap process timed out: context deadline exceeded ERROR: Bootstrap failed. Aborting execution.
Expected results:
Installation should be successful.
Additional info:
Description of problem:
sometimes cluster-capi-operator pod stuck in CrashLoopBackOff on osp
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-01-213905
How reproducible:
Sometimes
Steps to Reproduce:
1.Create an osp cluster with TechPreviewNoUpgrade 2.Check cluster-capi-operator pod 3.
Actual results:
cluster-capi-operator pod in CrashLoopBackOff status $ oc get po cluster-capi-operator-74dfcfcb9d-7gk98 0/1 CrashLoopBackOff 6 (2m54s ago) 41m $ oc get po cluster-capi-operator-74dfcfcb9d-7gk98 1/1 Running 7 (7m52s ago) 46m $ oc get po cluster-capi-operator-74dfcfcb9d-7gk98 0/1 CrashLoopBackOff 7 (2m24s ago) 50m E0806 03:44:00.584669 1 kind.go:66] "kind must be registered to the Scheme" err="no kind is registered for the type v1alpha7.OpenStackCluster in scheme \"github.com/openshift/cluster-capi-operator/cmd/cluster-capi-operator/main.go:86\"" logger="controller-runtime.source.EventHandler" E0806 03:44:00.685539 1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for clusteroperator caches to sync: timed out waiting for cache to be synced for Kind *v1alpha7.OpenStackCluster" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" I0806 03:44:00.685610 1 internal.go:516] "Stopping and waiting for non leader election runnables" I0806 03:44:00.685620 1 internal.go:520] "Stopping and waiting for leader election runnables" I0806 03:44:00.685646 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685706 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" I0806 03:44:00.685712 1 controller.go:242] "All workers finished" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" I0806 03:44:00.685717 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685722 1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685718 1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret" I0806 03:44:00.685720 1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" I0806 03:44:00.685823 1 recorder_in_memory.go:80] &Event{ObjectMeta:{dummy.17e906d425f7b2e1 dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:CustomResourceDefinitionUpdateFailed,Message:Failed to update CustomResourceDefinition.apiextensions.k8s.io/openstackclusters.infrastructure.cluster.x-k8s.io: Put "https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/openstackclusters.infrastructure.cluster.x-k8s.io": context canceled,Source:EventSource{Component:cluster-capi-operator-capi-installer-apply-client,Host:,},FirstTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,LastTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} I0806 03:44:00.719743 1 capi_installer_controller.go:309] "CAPI Installer Controller is Degraded" logger="CapiInstallerController" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc" E0806 03:44:00.719942 1 controller.go:329] "Reconciler error" err="error during reconcile: failed to set conditions for CAPI Installer controller: failed to sync status: failed to update cluster operator status: client rate limiter Wait returned an error: context canceled" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"
Expected results:
cluster-capi-operator pod is always Running
Additional info:
Please review the following PR: https://github.com/openshift/bond-cni/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When running /bin/bridge and trying to access localhost:9000 while the frontend is still starting, the bridge crashes as it cannot find frontend/public/dist/index.html
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
Always
Steps to Reproduce:
1. Build the OpenShift Console backend and run /bin/bridge 2. Try to access localhost:9000 while it is still starting
Actual results:
Bridge crash
Expected results:
No crash, either return HTTP 404/500 to the browser or serve a fallback page
Additional info:
This is just a minor dev annoyance
Description of problem:
when user tries to create Re-encrypt type route, there is no place to upload 'Destination CA certificate'
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-09-120947
How reproducible:
Always
Steps to Reproduce:
1. create Secure route, TLS termination: Re-encrypt 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
ci/prow/security is failing on google.golang.org/grpc/metadata
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. run ci/pro/security job on 4.15 pr 2. 3.
Actual results:
Medium severity vulnerability found in google.golang.org/grpc/metadata
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/62
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
The MAC mapping validation added in MGMT-17618 caused a regression on ABI.
To avoid this regression, the validation should be mitigated to validate only non-predictable interface names.
We should still make sure at least one MAC address exist in the MAC map, to be able to detect the relevant host.
slack discussion.
How reproducible:
100%
Steps to reproduce:
Actual results:
error 'mac-interface mapping for interface xxxx is missing'
Expected results:
Installation succeeds and the interfaces are correctly configured.
Description of problem:
When using configuring an OpenID idp that can only be accessed via the data plane, if the hostname of the provider can only be resolved by the data plane, reconciliation of the idp fails.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Configure an OpenID idp on a HostedCluster with a URL that points to a service in the dataplane (like https://keycloak.keycloak.svc)
Actual results:
The oauth server fails to be reconciled
Expected results:
The oauth server reconciles and functions properly
Additional info:
Follow up to OCPBUGS-37753
kube rebase broke TechPreview hypershift on 4.18 with resource.k8s.io group going to v1alpha3
KAS fails to start with
E1010 19:05:25.175819 1 run.go:72] "command failed" err="group version resource.k8s.io/v1alpha2 that has not been registered"
KASO addressed it here
https://github.com/openshift/cluster-kube-apiserver-operator/pull/1731
Description of problem:
There are two enhancements we could have for cns-migration:
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:
sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt KubeConfig is: /tmp/kubeconfig I0806 07:59:34.884908 131 logger.go:28] logging successfully to vcenter I0806 07:59:36.078911 131 logger.go:28] ----------- Migration Summary ------------ I0806 07:59:36.078944 131 logger.go:28] Migrated 0 volumes I0806 07:59:36.078960 131 logger.go:28] Failed to migrate 0 volumes I0806 07:59:36.078968 131 logger.go:28] Volumes not found 0
See the source datastore checing:
sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt KubeConfig is: /tmp/kubeconfig I0806 08:02:08.719657 138 logger.go:28] logging successfully to vcenter E0806 08:02:08.749709 138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter
2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
See Description
Description of problem:
normal user(project admin) visit Routes Metrics tab, only empty page returned
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-21-014704
How reproducible:
Always
Steps to Reproduce:
1. normal user has a project and a route 2. visit Networking -> Routes -> Metrics tab 3.
Actual results:
empty page returned
Expected results:
- we may should not expose Metrics tab for normal user(compared with 4.16 behavior) - if Metrics tab is supposed to be exposed to normal user, then we should return correct content instead of empty page
Additional info:
Description of problem:
The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.
Version-Release number of selected component (if applicable):
4.15.z and later
How reproducible:
Always when AlertmanagerConfig is enabled
Steps to Reproduce:
1. Enable UWM with AlertmanagerConfig enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true 2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file) 3. Wait for a couple of minutes.
Actual results:
Monitoring ClusterOperator goes Degraded=True.
Expected results:
No error
Additional info:
The Prometheus operator logs show that it doesn't understand the proxy_from_environment field. The newer proxy fields are supported since Alertmanager v0.26.0 which is equivalent to OCP 4.15 and above.
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/753
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.
Version-Release number of selected component (if applicable):
[fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. use imageSetConfig.yaml as shown below 2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2 3.
Actual results:
fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2 2024/08/03 09:24:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/08/03 09:24:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/08/03 09:24:38 [INFO] : ⚙️ setting up the environment for you... 2024/08/03 09:24:38 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/08/03 09:24:38 [INFO] : 🕵️ going to discover the necessary images... 2024/08/03 09:24:38 [INFO] : 🔍 collecting release images... 2024/08/03 09:24:44 [INFO] : kubeVirtContainer set to true [ including : ] 2024/08/03 09:24:44 [ERROR] : unknown image : reference name is empty 2024/08/03 09:24:44 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/08/03 09:24:44 [ERROR] : unknown image : reference name is empty
Expected results:
If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.
Additional info:
[fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.12 minVersion: 4.12.61 maxVersion: 4.12.61 kubeVirtContainer: true operators: - catalog: oci:///test/ibm-catalog - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator minVersion: "0.26.0" - name: nfd maxVersion: "4.15.0-202402210006" - name: cluster-logging minVersion: 5.8.3 maxVersion: 5.8.4 - name: quay-bridge-operator channels: - name: stable-3.9 minVersion: 3.9.5 - name: quay-operator channels: - name: stable-3.9 maxVersion: "3.9.1" - name: odf-operator channels: - name: stable-4.14 minVersion: "4.14.5-rhodf" maxVersion: "4.14.5-rhodf" additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308
Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/295
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Azure Disk CSI driver operator runs node DaemonSet that exposes CSI driver metrics on loopback, but there is no kube-rbac-proxy in front of it and there is no Service / ServiceMonitor for it. Therefore OCP doesn't collect these metrics.
Description of problem:
In 4.16 version now we can collapse and expand the "Getting Started resource" section under administrative perspective. But as in the earlier version, we can directly remove this tab [X], which is not there in the 4.16 version. There is only an expand and collapse function available, but removing that tab is not available as it was there in previous versions.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Go to Web console. Click on the "Getting started resources." 2. Then you can expand and collapse this tab. 3. But there is no option to directly remove this tab.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/90
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
build openshift/ovn-kubernetes#2291
How reproducible:
Always
Steps to Reproduce:
1. Create a ns ns1
2. Create a CRD in ns1
% oc get UserDefinedNetwork -n ns1 -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: UserDefinedNetwork metadata: creationTimestamp: "2024-09-09T08:34:49Z" finalizers: - k8s.ovn.org/user-defined-network-protection generation: 1 name: udn-network namespace: ns1 resourceVersion: "73943" uid: c923b0b1-05b4-4889-b076-c6a28f7353de spec: layer3: role: Primary subnets: - cidr: 10.200.0.0/16 hostSubnet: 24 topology: Layer3 status: conditions: - lastTransitionTime: "2024-09-09T08:34:49Z" message: NetworkAttachmentDefinition has been created reason: NetworkAttachmentDefinitionReady status: "True" type: NetworkReady kind: List metadata: resourceVersion: ""
3. Create a service and pods in ns1
% oc get svc -n ns1 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-service ClusterIP 172.30.16.88 <none> 27017/TCP 5m32s % oc get pods -n ns1 NAME READY STATUS RESTARTS AGE test-rc-f54tl 1/1 Running 0 5m4s test-rc-lhnd7 1/1 Running 0 5m4s % oc exec -n ns1 test-rc-f54tl -- ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:1b brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.27/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe80:21b/64 scope link valid_lft forever preferred_lft forever 3: ovn-udn1@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:c8:03:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.3.3/24 brd 10.200.3.255 scope global ovn-udn1 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fec8:303/64 scope link valid_lft forever preferred_lft forever 4. Restart ovn pods {code:java} % oc delete pods --all -n openshift-ovn-kubernetes pod "ovnkube-control-plane-76fd6ddbf4-j69j8" deleted pod "ovnkube-control-plane-76fd6ddbf4-vnr2m" deleted pod "ovnkube-node-5pd5w" deleted pod "ovnkube-node-5r9mg" deleted pod "ovnkube-node-6bdtx" deleted pod "ovnkube-node-6v5d7" deleted pod "ovnkube-node-8pmpq" deleted pod "ovnkube-node-cffld" deleted
Actual results: {code:java} % oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-control-plane-76fd6ddbf4-9cklv 2/2 Running 0 9m22s ovnkube-control-plane-76fd6ddbf4-gkmlg 2/2 Running 0 9m22s ovnkube-node-bztn5 7/8 CrashLoopBackOff 5 (21s ago) 9m19s ovnkube-node-qhjsw 7/8 Error 5 (2m45s ago) 9m18s ovnkube-node-t5f8p 7/8 Error 5 (2m32s ago) 9m20s ovnkube-node-t8kpp 7/8 Error 5 (2m34s ago) 9m19s ovnkube-node-whbvx 7/8 Error 5 (2m35s ago) 9m20s ovnkube-node-xlzlh 7/8 CrashLoopBackOff 5 (14s ago) 9m18s ovnkube-controller: Container ID: cri-o://977dd8c17320695b1098ea54996bfad69c14dc4219a91dfd4354c818ea433cac Image: registry.build05.ci.openshift.org/ci-ln-y1ypd82/stable@sha256:3110151b89e767644c01c8ce2cf3fec4f26f6d6e011262d0988c1d915d63355f Image ID: registry.build05.ci.openshift.org/ci-ln-y1ypd82/stable@sha256:3110151b89e767644c01c8ce2cf3fec4f26f6d6e011262d0988c1d915d63355f Port: 29105/TCP Host Port: 29105/TCP Command: /bin/bash -c set -xe . /ovnkube-lib/ovnkube-lib.sh || exit 1 start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: :205] Sending *v1.Node event handler 7 for removal I0909 08:45:58.537155 170668 factory.go:542] Stopping watch factory I0909 08:45:58.537167 170668 handler.go:219] Removed *v1.Node event handler 7 I0909 08:45:58.537185 170668 handler.go:219] Removed *v1.Namespace event handler 1 I0909 08:45:58.537198 170668 handler.go:219] Removed *v1.Namespace event handler 5 I0909 08:45:58.537206 170668 handler.go:219] Removed *v1.EgressIP event handler 8 I0909 08:45:58.537207 170668 handler.go:219] Removed *v1.EgressFirewall event handler 9 I0909 08:45:58.537187 170668 handler.go:219] Removed *v1.Node event handler 10 I0909 08:45:58.537219 170668 handler.go:219] Removed *v1.Node event handler 2 I0909 08:45:58.538642 170668 network_attach_def_controller.go:126] [network-controller-manager NAD controller]: shutting down I0909 08:45:58.538703 170668 secondary_layer3_network_controller.go:433] Stop secondary layer3 network controller of network ns1.udn-network I0909 08:45:58.538742 170668 services_controller.go:243] Shutting down controller ovn-lb-controller for network=ns1.udn-network I0909 08:45:58.538767 170668 obj_retry.go:432] Stop channel got triggered: will stop retrying failed objects of type *v1.Node I0909 08:45:58.538754 170668 obj_retry.go:432] Stop channel got triggered: will stop retrying failed objects of type *v1.Pod E0909 08:45:58.5 Exit Code: 1 Started: Mon, 09 Sep 2024 16:44:57 +0800 Finished: Mon, 09 Sep 2024 16:45:58 +0800 Ready: False Restart Count: 5 Requests: cpu: 10m memory: 600Mi
Expected results:
ovn pods should not crash
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
Deploy a 4.18 cluster on a PowerVS zone where LoadBalancers are slow to create. We are called with InfraReady. We then create DNS records for the LBs. However, only the public LB exists. So the cluster fails to deploy. The internal LB does eventually complete.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Occassionally on a zone with slow LB creation.
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/125
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When a kubevirt-csi pod runs on a worker node of a Guest cluster, the underlying PVC from the infra/host cluster is attached to the Virtual Machine that is the worker node of the Guest cluster. That works well, but only until the VM is rebooted. After the VM is power cycled for some reason, the volumeattachment on the Guest cluster is still there and shows as attached. [guest cluster]# oc get volumeattachment NAME ATTACHER PV NODE ATTACHED AGE csi-976b6b166ef7ea378de9a350c9ef427c23e8c072dc6e76a392241d273c3effdb csi.kubevirt.io pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b hostedcluster2-rlq9m-z2x88 true 39m But the VM does not have the hotplugged disk anymore (its not a persistent hotplug). Its not attached at all. It only has its rhcos disk and cloud-init after the reboot: [host cluster]# oc get vmi -n clusters-hostedcluster2 hostedcluster2-rlq9m-z2x88 -o yaml | yq '.status.volumeStatus' - name: cloudinitvolume size: 1048576 target: vdb - name: rhcos persistentVolumeClaimInfo: accessModes: - ReadWriteOnce capacity: storage: 32Gi claimName: hostedcluster2-rlq9m-z2x88-rhcos filesystemOverhead: "0" requests: storage: "34359738368" volumeMode: Block target: vda The result is all workloads with PVCs now fail to start, as the hotplug is not triggered again. The worker node VM cannot find the disk: 26s Warning FailedMount pod/mypod MountVolume.MountDevice failed for volume "pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b" : rpc error: code = Unknown desc = couldn't find device by serial id So workload pods cannot start.
Version-Release number of selected component (if applicable):
OCP 4.17.3 CNV 4.17.0 MCE 2.7.0
How reproducible:
Always
Steps to Reproduce:
1. Have a pod running with a PV from kubevirt-csi in the guest cluster 2. Shutdown the Worker VM running the Pod and start it again
Actual results:
Workloads fail to start after VM reboot
Expected results:
Hotplug the disk again and let workloads start
Additional info:
Description of problem:
When running 4.17 installer QE full function test, following am64 instances types are detected and tested passed, so append them in installer doc[1]: * standardBasv2Family * StandardNGADSV620v1Family * standardMDSHighMemoryv3Family * standardMIDSHighMemoryv3Family * standardMISHighMemoryv3Family * standardMSHighMemoryv3Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
To summarize, when we meet the following three conditions, baremetal nodes cannot boot due to a hostname resolution failure.
According to the following update, the provisioning service checks the BMC address scheme on the target and provides a matching URL for the installation media:
When we create a BMH resource, spec.bmc.address will be an URL of the BMC.
However, when we put a hostname instead of an IP address in the spec.bmc.address like the following example,
<Example BMH definition>
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
:
spec:
bmc:
address: redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1
we observe the following error.
$ oc logs -n openshift-machine-api metal3-baremetal-operator-6779dff98c-9djz7 {"level":"info","ts":1721660334.9622784,"logger":"provisioner.ironic","msg":"Failed to look up the IP address for BMC hostname","host":"myenv~mybmh","hostname":"redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1"}
Because of name resolution failure, baremetal-operator cannot determine if the BMC is IPv4 or IPv6.
Therefore, the IP scheme is fall-back to IPv4 and ISO images are exposed via IPv4 address even if the BMC is IPv6 single stack.
In this case, the IPv6 BMC cannot access to the ISO image on IPv4, we observe error messages like the following example, and the baremetal host cannot boot from the ISO.
<Error message on iDRAC> Unable to locate the ISO or IMG image file or folder in the network share location because the file or folder path or the user credentials entered are incorrect
The issue is caused by the following implementation.
The following line passes `p.bmcAddress` which is whole URL, that's why the name resolution fails.
I think we should pass `parsedURL.Hostname()` instead, which is the hostname part of the URL.
https://github.com/metal3-io/baremetal-operator/blob/main/pkg/provisioner/ironic/ironic.go#L657
ips, err := net.LookupIP(p.bmcAddress)
Version-Release number of selected component (if applicable):
We observe this issue on OCP 4.14 and 4.15. But I think this issue occurs even in the latest releases.
How reproducible:
Steps to Reproduce:
Actual results:
Name resolution fails and the baremetal host cannot boot
Expected results:
Name resolution works and the baremetal host can boot
Additional info:
Description of problem:
Hypershift doesn't allow to configure the Failure Domains for node pools; which could help to put machines into the desired availability zone.
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/91
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-machine-api-provider-gcp-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
Creating and destroying transit gateways (TG) during CI testing is costing an abnormal amount of money. Since the monetary cost for creating a TG is high, provide support for a user created TG when creating an OpenShift cluster.
Version-Release number of selected component (if applicable):
all
How reproducible:
always
Description of problem:
https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
For conditional updates, status.conditionalUpdates.release is also a Release type https://github.com/openshift/console/blob/master/frontend/public/module/k8s/types.ts#L812-L815 which will also trigger Admission Webhook Warning
Version-Release number of selected component (if applicable):
4.18.0-ec.2
How reproducible:
Always
Steps to Reproduce:
1.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/80
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating an OCP cluster on AWS and selecting "publish: Internal," the ingress operator may create external LB mappings to external subnets. This can occur if public subnets were specified during installation at install-config. https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-private.html#private-clusters-about-aws_installing-aws-private A configuration validation should be added to the installer.
Version-Release number of selected component (if applicable):
4.14+ probably older versions as well.
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1714986876688959
Description of problem:
https://issues.redhat.com//browse/OCPBUGS-31919 partially fixed an issue consuming the test image from a custom registry. The fix is about consuming in the test binary the pull-secret of the cluster under tests. To complete it we have to do the same trusting custom CA as the cluster under test. Without that, if the test image is exposed by a registry where the TLS cert is signed by a custom CA, the same tests will fail as for: { fail [github.com/openshift/origin/test/extended/operators/certs.go:120]: Unexpected error: <*errors.errorString | 0xc0023105c0>: unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342: StdOut> error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority StdErr> error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority exit status 1 { s: "unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:\nStdOut>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nStdErr>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nexit status 1\n", } occurred Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
release-4.16, release-4.17 and master branchs in origin.
How reproducible:
Always
Steps to Reproduce:
1. try to run the test suite against a cluster where the OCP release (and the test image) comes from a private registry with a cert signed by a custom CA 2. 3.
Actual results:
3 failing tests: : [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] expand_more : [sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] expand_more : [sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel] expand_more
Expected results:
No failing tests
Additional info:
OCPBUGS-31919 partially fixed it having the test binary downloading the pull secret from the cluster under test. But in order to have it working we have also to trust custom CAs trusted by the cluster under test
Description of problem:
node-joiner pod does not honour cluster wide testing
Version-Release number of selected component (if applicable):
OCP 4.16.6
How reproducible:
Always
Steps to Reproduce:
1. Configure an OpenShift cluster wide proxy according to https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html and add Red Hat urls (quay.io and alii) to the proxy allow list. 2. Add a node to a cluster using a node joiner pod, following https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/add-nodes.md
Actual results:
Error retrieving the images on quay.io time=2024-08-22T08:39:02Z level=error msg=Release Image arch could not be found: command '[oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd -o=go-template={{if and .metadata.metadata (index . "metadata" "metadata" "release.openshift.io/architecture")}}{{index . "metadata" "metadata" "release.openshift.io/architecture"}}{{else}}{{.config.architecture}}{{end}} --insecure=true --registry-config=/tmp/registry-config1164077466]' exited with non-zero exit code 1:time=2024-08-22T08:39:02Z level=error msg=error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd: Get "http://quay.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Expected results:
node-joiner is able to downoad the images using the proxy
Additional info:
By allowing full direct internet access, without a proxy, the node joiner pod is able to download image from quay.io.
So there is a strong suspicion that the http timeout error above comes from the pod not being to use the proxy.
Restricted environementes when external internet access is only allowed through a proxy allow lists is quite common in corporate environements.
Please consider honouring the openshift proxy configuration .
Description of problem:
Circular dependencies in OCP Console prevent migration of Webpack 5
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Enable the CHECK_CYCLES env var while building 2. Observe errors 3.
Actual results:
There are errors
Expected results:
No errors
Additional info:
During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.
When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:
However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).
When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.
This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:
This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.
4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.
4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.
100%
I'm unsure how to replicate this in vanilla cluster install, but via OSD:
Example:
$ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp
Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.
Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.
A standard 3 control-plane-node cluster is created.
We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.
The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:
{ "controlPlane": [ "us-west2-a", "us-west2-b", "us-west2-c" ], "compute": [ "us-west2-c", <--- inverted order. Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow? "us-west2-b", "us-west2-a" ], "platform": { "defaultMachinePlatform": { <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here "osDisk": { "DiskSizeGB": 0, "diskType": "" }, "secureBoot": "Enabled", "type": "" }, "projectID": "anishpatel", "region": "us-west2" } }
Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.
Description of problem:
azure-disk-csi-driver doesnt use registryOverrides
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1.set registry override on CPO 2.watch that azure-disk-csi-driver continues to use default registry 3.
Actual results:
azure-disk-csi-driver uses default registry
Expected results:
azure-disk-csi-driver mirrored registry
Additional info:
See this comment for background:
https://github.com/openshift/origin/blob/6b07170abad135bc7c5b22c78b2079ceecfc8b51/test/extended/etcd/vertical_scaling.go#L75-L86
The current vertical scaling test relies triggers CPMSO to create a new machine by first deleting an existing machine. In that test we can't validate that the new machine is scaled-up before the old one is removed.
Another test we could add is to first disable the CPMSO and then delete an existing machine and manually create a new machine like we did before the CPMSO.
That way we can validate that the scale-down does not happen before the scale-up event.
Description of problem:
Error is thrown by the broker form view for a pre-populated application name The error reads: formData.application.selectedKey must be a `string` type, but the final value was: `null`. If "null" is intended as an empty value be sure to mark the schema as `.nullable()` _
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Install serverless operator 2. Create any application in a namespace 3. Now open broker in form view
Actual results:
You have to select no application or any other application for the form view to work
Expected results:
Error should not be thrown for the appropriate value
Additional info:
Attaching a video of the error
https://drive.google.com/file/d/1WRp2ftMPlCG0ZiHZwC0QfleES3iVHObq/view?usp=sharing
Note: also notify the Hive team we're doing these bumps.
Maxim Patlasov pointed this out in STOR-1453 but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.
It is possible to set a custom TLSSecurityProfile without minTLSversion:
$ oc edit apiserver cluster
...
spec:
tlsSecurityProfile:
type: Custom
custom:
ciphers:
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-ECDSA-AES128-GCM-SHA256
This causes the controller to crash loop:
$ oc get pods -n openshift-cluster-csi-drivers
NAME READY STATUS RESTARTS AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2 6/11 CrashLoopBackOff 10 (18s ago) 37s
...
because the `${TLS_MIN_VERSION}` placeholder is never replaced:
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
The observed config in the ClusterCSIDriver shows an empty string:
$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
"targetcsiconfig": {
"servingInfo":
}
}
which means minTLSVersion is empty when we get to this line, and the string replacement is not done:
So it seems we have a couple of options:
1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object
Description of problem:
Azure HostedClusters are failing in OCP 4.17 due to issues with the cluster-storage-operator.
- lastTransitionTime: "2024-05-29T19:58:39Z" message: 'Unable to apply 4.17.0-0.nightly-multi-2024-05-29-121923: the cluster operator storage is not available' observedGeneration: 2 reason: ClusterOperatorNotAvailable status: "True" type: ClusterVersionProgressing
I0529 20:05:21.547544 1 status_controller.go:218] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2024-05-29T20:02:00Z","message":"AzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: \"node_service.yaml\" (string): namespaces \"clusters-test-case4\" not found\nAzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: ","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverGuestStaticResourcesController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2024-05-29T20:04:15Z","message":"AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"True","type":"Progressing"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"False","type":"Available"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2024-05-29T19:59:00Z","reason":"NoData","status":"Unknown","type":"EvaluationConditionsDetected"}]}} I0529 20:05:21.566215 1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"azure-cloud-controller-manager", UID:"205a4307-67e4-481e-9fee-975b2c5c40fb", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nAzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"
On the HostedCluster itself, these errors with the csi pods not coming up are:
% k describe pod/azure-disk-csi-driver-node-5hb24 -n openshift-cluster-csi-drivers | grep fail Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5 Liveness: http-get http://:rhealthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5 Warning FailedMount 2m (x28 over 42m) kubelet MountVolume.SetUp failed for volume "metrics-serving-cert" : secret "azure-disk-csi-driver-node-metrics-serving-cert" not found
There was an error with the CO as well:
storage 4.17.0-0.nightly-multi-2024-05-29-121923 False True True 49m AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Every time
Steps to Reproduce:
1. Create a HC with a 4.17 nightly
Actual results:
Azure HC does not complete; nodes do join NodePool though
Expected results:
Azure HC should complete
Additional info:
Refactor name to Dockerfile.ocp as a better, version independent alternative
Description of problem:
Console dynamic plugins may declare their extensions using TypeScript, e.g. Kubevirt plugin-extensions.ts module.
The EncodedExtension type should be exposed directly via Console plugin SDK, instead of plugins having to import this type from the dependent OpenShift plugin SDK packages.
Description of problem:
IPI Baremetal - BootstrapVM machineNetwork interface restart impacts pulling image and causes ironic service to fail
Version-Release number of selected component (if applicable):
4.16.Z but also seen this in 4.15 and 4.17
How reproducible:
50% of our jobs fail because of this.
Steps to Reproduce:
1. Prepare an IPI baremetal deployment (we have provisioning network disabled, we are using Virtual Media) 2. Start a deployment, wait for the bootstrapVM to start running and login via SSH 3. Run the command: journalctl -f | grep "Dependency failed for Ironic baremetal deployment service" 4. If the command above returns something, then print around 70 lines before and check for the NetworkManager entries in the log about the interface in the baremetal network getting restarted and an error about pulling an image because DNS is not reachable.
Actual results:
Deployments fail 50% of the time, bootstrapVM is not able to pull an image because main machineNetwork interface is getting restarted and DNS resolution fails.
Expected results:
Deployments work 100% of the time, bootstrapVM is able to pull any image because machineNetwork interface is NOT restarted while images are getting pulled.
Additional info:
We have a CI system to test OCP 4.12 through 4.17 deployments and this issue started to occurred a few weeks ago. mainly in 4.15, 4.16, and 4.17
In this log extract of a deployment with OCP 4.16.0-0.nightly-2024-07-07-171226: you can see the image pull error because is not able to resolve the registry name, but in the lines before and after you can see that the machineNetwork interface is getting restarted, causing the lack of DNS resolution.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Build Ironic environment. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Extract Machine OS Images... Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Provisioning interface... Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1... Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.3899] audit: op="connection-update" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" name="Wired connection 1" args="ipv4.addresses" pid=3812 uid=0 result="success" Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <warn> [1720480515.4008] keyfile: load: "/etc/NetworkManager/system-connections/nmconnection": failed to load connection: invalid connection: connection.type: property is missing Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4018] audit: op="connections-reload" pid=3817 uid=0 result="success" Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4159] agent-manager: agent[543677841603162b,:1.67/nmcli-connect/0]: agent registered Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4164] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4170] manager: NetworkManager state is now CONNECTED_LOCAL Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4172] device (ens3): disconnecting for new activation request. Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4172] audit: op="connection-activate" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" name="Wired connection 1" pid=3821 uid=0 result="success" Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4200] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4214] dhcp4 (ens3): canceled DHCP transaction Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4215] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4215] dhcp4 (ens3): state changed no lease Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4216] dhcp6 (ens3): canceled DHCP transaction Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4216] dhcp6 (ens3): activation: beginning transaction (timeout in 45 seconds) Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4216] dhcp6 (ens3): state changed no lease Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1: (Mirrors also failed: [registry.dfwt5g.lab:4443/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226@sha256:1370c041f0ecf4f6590c1 2f3e1b49111aa35052140e7fdd79964c32db47074c1: Get "https://registry.dfwt5g.lab:4443/v2/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226/manifests/sha256:1370c041f0ecf4f6590c12f3e1b49111a a35052140e7fdd79964c32db47074c1": dial tcp 192.168.5.9:4443: connect: network is unreachable]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b491 11aa35052140e7fdd79964c32db47074c1: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 192.168.32.8:53: dial udp 192.168.32.8:53: connect: network is unreachable Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Main process exited, code=exited, status=125/n/a Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2607:b500:410:7700::1 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4309] policy: set-hostname: set hostname to 'localhost.localdomain' (no hostname found) Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4309] device (ens3): Activation: starting connection 'Wired connection 1' (bf7e41e3-f1ea-3eed-98fd-c3d021e35d11) Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2001:470:f1c4:1::42 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4315] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2603:c020:0:8369::feeb:dab offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4317] manager: NetworkManager state is now CONNECTING Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2600:3c01:e000:7e6::123 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4317] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 192.168.32.8 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4322] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4326] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 offline Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4347] dhcp4 (ens3): state changed new lease, address=192.168.32.28 Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4350] policy: set 'Wired connection 1' (ens3) as default for IPv4 routing and DNS Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.4385] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Removed source 192.168.32.8 Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 online Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 online Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 online Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 online Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Failed with result 'exit-code'. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Failed to start Extract Machine OS Images. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Customized Machine OS Image Server. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Ironic baremetal deployment service. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: ironic.service: Job ironic.service/start failed with result 'dependency'. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Metal3 deployment service. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: metal3-baremetal-operator.service: Job metal3-baremetal-operator.service/start failed with result 'dependency'. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: image-customization.service: Job image-customization.service/start failed with result 'dependency'. Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Ironic ramdisk logger... Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Update master BareMetalHosts with introspection data... Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: NM local-dns-prepender triggered by ens3 dhcp4-change. Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: <13>Jul 8 23:15:15 root: NM local-dns-prepender triggered by ens3 dhcp4-change. Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3901]: NM resolv-prepender: Checking for nameservers in /var/run/NetworkManager/resolv.conf Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3903]: nameserver 192.168.32.8 Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3905]: Failed to get unit file state for systemd-resolved.service: No such file or directory Mon 2024-07-08 23:15:15 UTC localhost.localdomain root[3911]: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3911]: <13>Jul 8 23:15:15 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: <13>Jul 8 23:15:15 root: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.5372] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain provisioning-interface.service[3821]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveMon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.5375] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.5377] manager: NetworkManager state is now CONNECTED_SITE Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.5379] device (ens3): Activation: successful, device activated. Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info> [1720480515.5383] manager: NetworkManager state is now CONNECTED_GLOBAL Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Provisioning interface.
Please review the following PR: https://github.com/openshift/router/pull/624
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.
According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)
> "Aug 16 16:43:17.672 - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"
The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:
> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 16:43:17.650083 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...
The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:
> I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:
{10250 [10.128.2.31 10.131.0.12] []}]
I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"
AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.
For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40
bad decisions may be made if the if the services and/or endpoints cache are stale.
Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:
> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update
We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod
Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.
To summarize, I can see two problems here (maybe one is the consequence of the other):
A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.
Version-Release number of selected component (if applicable):
{code:none}
How reproducible:
Steps to Reproduce:
1. See "Description of problem" 2. 3.
Actual results:
Expected results:
the kube-aggregator should detect stale Apiservice endpoints.
Additional info:
the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
Description of problem:
While running batches of 500 managedclusters upgrading via Image-Based Upgrades (IBU) via RHACM and TALM, frequently the haproxy load balancer configured by default for a bare metal cluster in the openshift-kni-infra namespace would run out of connections despite being tuned for 20,000 connections.
Version-Release number of selected component (if applicable):
Hub OCP - 4.16.3 Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3 ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48 TALM - 4.16.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
While monitoring the current connections during a CGU batch of 500 SNOs to IBU to a new OCP version I would observe the oc cli returning "net/http: TLS handshake timeout" and if I monitoring the current connections via rsh into the active haproxy pod: # oc -n openshift-kni-infra rsh haproxy-d16-h10-000-r650 Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init) sh-5.1$ echo "show info" | socat stdio /var/lib/haproxy/run/haproxy.sock | grep CurrConns CurrConns: 20000 sh-5.1$ While capturing this value every 10 or 15 seconds I would observe a high fluctuation of the number of connections such as Thu Aug 8 17:51:57 UTC 2024 CurrConns: 17747 Thu Aug 8 17:52:02 UTC 2024 CurrConns: 18413 Thu Aug 8 17:52:07 UTC 2024 CurrConns: 19147 Thu Aug 8 17:52:12 UTC 2024 CurrConns: 19785 Thu Aug 8 17:52:18 UTC 2024 CurrConns: 20000 Thu Aug 8 17:52:23 UTC 2024 CurrConns: 20000 Thu Aug 8 17:52:28 UTC 2024 CurrConns: 20000 Thu Aug 8 17:52:33 UTC 2024 CurrConns: 20000 A brand new hub cluster without any spoke clusters and without ACM installed runs between 53-56 connections, after installing ACM I would see the connection count rise to 56-60 connections. In a smaller environment with only 297 managedclusters I observed between 1410-1695 connections. I do not have a measurement of how many approximate connections we need in the large environment however it clearly fluctuates and the initiation of the IBU upgrades seems to spike it to the current default limit triggering the timeout error message.
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
CSS overrides in the OpenShift console are applied to ACM dropdown menu
Version-Release number of selected component (if applicable):
4.14, 4.15
How reproducible:
Always
Steps to Reproduce:
View ACM, Governance > Policies. Actions dropdown
Actual results:
Actions are indented and preceded by bullets
Expected results:
Dropdown menu style should not be affected
Additional info:
Description of problem:
while applying "oc adm upgrade --to-multi-arch" certain flags such as --to and --to-image are blocked with error message such as: error: --to-multi-arch may not be used with --to or --to-image however if one applies --force, or --to-latest, no error message is generated, only: Requested update to multi cluster architecture and the flags are omitted silently, applying .spec: desiredUpdate: architecture: Multi force: false <- --force silently have no effect here image: version: 4.13.0-ec.2 <- --to-latest omitted silently either
Version-Release number of selected component (if applicable):
4.13.0-ec.2 but seen elsewhere
How reproducible:
100%
Steps to Reproduce:
1. oc adm upgrade --to-multi-arch --force 2. oc adm upgrade --to-multi-arch --to-latest 3. oc adm upgrade --to-multi-arch --force --to-latest
Actual results:
omitted silently as explained above
Expected results:
either blocked with the same error as --to and --to-image or if there is a use case, should have the desired effect and not omitted
Please review the following PR: https://github.com/openshift/machine-os-images/pull/40
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/143
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/console-operator/pull/929
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The label data for networking services is inverted, it should be stored as "key=value", but it's currently stored as "value=key"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-09-120947 4.18.0-0.nightly-2024-09-09-212926
How reproducible:
Always
Steps to Reproduce:
1. Navigate to Networking - Services page. and create a sample Service with lable eg: apiVersion: v1 kind: Service metadata: name: exampleasd namespace: default labels: testkey1: testvalue1 testkey2: testvaule2 spec: selector: app: MyApp ports: - protocol: TCP port: 80 targetPort: 9376 2. Check the Labels on Service details page 3. Check the Labels on Labels column on Networking -> Services page
Actual results:
the data is shown as 'testvalue1=testkey1' and 'testvalue2=testkey2'
Expected results:
it should be shown as 'testkey1=testvalue1' and 'testkey2=testvalue2'
Additional info:
https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/work/tasks/7701/65007701/x86_64.log
we need no-build-isolation in the download command too
this has been verified with ART
Description of problem:
Creating a faulty configmap for UWM results in cluster_operator_up=0 with the reason InvalidConfiguration. With https://issues.redhat.com/browse/MON-3421 we're expecting the reason to match UserWorkload.*
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
100%
Steps to Reproduce:
apply the following CM to a cluster with UWM enabled: apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | hah helo! :)
Actual results:
cluster_operator_up=0 with reason InvalidConfiguration
Expected results:
cluster_operator_up=0 with reason matching pattern UserWorkload.*
Additional info:
https://issues.redhat.com/browse/MON-3421 streamlined reasons to allow separation between UWM and cluster monitoring. The above is a leftover that should be updated to match the same pattern.
Description of problem:
This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing. LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue.
Version-Release number of selected component (if applicable):
4.15.11
How reproducible:
Steps to Reproduce:
(From the customer) 1. Configure LDAP IDP 2. Configure Proxy 3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
Actual results:
LDAP IDP communication from the control plane oauth pod goes through proxy
Expected results:
LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings
Additional info:
For more information, see linked tickets.
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/421
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
This seems to be a requirement to set Project/namespace.However, in the CLI, RoleBinding objects can be created without namespace with no issues.
$ oc describe rolebinding.rbac.authorization.k8s.io/monitor
Name: monitor
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: view
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount monitor
—
This is inconsistent with the dev console, causing confusion for developers and administrators and making things cumbersome for administrators.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Login to the web console for Developer. 2. Select Project on the left. 3. Select 'Project Access' tab. 4. Add access -> Select Sevice Account on the dropdown
Actual results:
Save button is not active when no project is selected
Expected results:
The Save button is enabled even though the Project is not selected, so that it can be created just as it is handled in the CLI.
Additional info:
Description of problem:
openshift-install create cluster leads to error: ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. Vsphere standard port group
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. openshift-install create cluster 2. Choose Vsphere 3. fill in the blanks 4. Have a standard port group
Actual results:
error
Expected results:
cluster creation
Additional info:
Description of problem:
The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands. These options exist in these docs: https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html but not in these docs: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user
Description of problem:
VirtualizedTable which is exposed to dynamic plugin is missing onRowsRendered prop which is available in VirtualTableBody of @patternfly/react-virtualized-extension package
Version-Release number of selected component (if applicable):
4.15.z
Actual results:
onRowsRendered prop is not available in VirtualizedTable component
Expected results:
onRowsRendered prop should be available in VirtualizedTable component
Additional info:
Description of problem:
Necessary security group rules are not created when using installer created VPC.
Version-Release number of selected component (if applicable):
4.17.2
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy a power vs cluster and have the installer create the VPC, or remove required rules from a VPC you're bringing. 2. Control plane nodes fail to bootstrap. 3. Fail
Actual results:
Install fails
Expected results:
Install succeeds
Additional info:
Fix identified
Description of problem:
In OpenShift 4.13-4.15, when a "rendered" MachineConfig in use is deleted, it's automatically recreated. In OpenShift 4.16, it's not recreated, and nodes and MCP becomes degraded due to the "rendered" not found error.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Create a MC to deploy any file in the worker MCP 2. Get the name of the new rendered MC, like for example "rendered-worker-bf829671270609af06e077311a39363e" 3. When the first node starts updating, delete the new rendered MC oc delete mc rendered-worker-bf829671270609af06e077311a39363e
Actual results:
Node degraded with "rendered" not found error
Expected results:
In OCP 4.13 to 4.15, the "rendered" MC is automatically re-created, and the node continues updating to the MC content without issues. It should be the same in 4.16.
Additional info:
The same behavior in 4.12 and older than now in 4.16. In 4.13-4.15, the "rendered" is re-created and no issues with the nodes/MCPs are shown.
Description of problem:
Azure-File volume mount failed, it happens on arm cluster with multi payload $ oc describe pod Warning FailedMount 6m28s (x2 over 95m) kubelet MountVolume.MountDevice failed for volume "pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2" : rpc error: code = InvalidArgument desc = GetAccountInfo(wduan-0319b-bkp2k-rg#clusterjzrlh#pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2###wduan) failed with error: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wduan-0319b-bkp2k-rg/providers/Microsoft.Storage/storageAccounts/clusterjzrlh/listKeys?api-version=2021-02-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token": dial tcp 20.190.190.193:443: i/o timeout'
The node log reports: W0319 09:41:30.745936 1 azurefile.go:806] GetStorageAccountFromSecret(azure-storage-account-clusterjzrlh-secret, wduan) failed with error: could not get secret(azure-storage-account-clusterjzrlh-secret): secrets "azure-storage-account-clusterjzrlh-secret" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:azure-file-csi-driver-node-sa" cannot get resource "secrets" in API group "" in the namespace "wduan"
Checked the role looks good, at least the same as previous: $ oc get clusterrole azure-file-privileged-role -o yaml ... rules: - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-multi-2024-03-13-031451
How reproducible:
2/2
Steps to Reproduce:
1. Checked in CI, azure-file cases failed due to this 2. Create one cluster with the same config and payload, create azure-file pvc and pod 3.
Actual results:
Pod could not be running
Expected results:
Pod should be running
Additional info:
Description of problem:
In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open /settings/cluster using Firefox with Dark mode selected 2. 3.
Actual results:
The version numbers under Update status are black
Expected results:
The version numbers under Update status are white
Additional info:
Description of problem:
There are 2 problematic tests in the ImageEcosystem testsuite in: the rails sample and the s2i perl test. This issue tries to fix them both at once so that we can get a passing image ecosystem test.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Run the imageecosystem testsuite 2. observe the {[Feature:ImageEcosystem][ruby]} and {[Feature:ImageEcosystem][perl]} test fail
Actual results:
The two tests fail
Expected results:
No test failures
Additional info:
Description of the problem:
After multiple re-installations over the exact same baremetal host and re-using the exact same parameters (such as Agent ID, Cluster name, domain, etc) - even if the postgres database does save latest entries, the eventsURL hits a limit so there is no direct way to check the progress.
How reproducible:
Steps to reproduce:
1. Install an SNO cluster in a Host
2. Fully wipe out all the resources in RHACM, including SNO project
3. Re-install exact same SNO in the same Host
4. Repeat steps 1-3 multiple times
Actual results:
Last ManagedCluster installed is from 09/09 and the postgres database contains its last installation logs:
installer=> SELECT * FROM events WHERE host_id LIKE 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201' ORDER BY event_time DESC; id | created_at | updated_at | deleted_at | category | cluster_id | event_time | host_id | infra_env_id | message | name | props | request_id | severity --------+-------------------------------+-------------------------------+------------+----------+--------------------------------------+----------------------------+ --------------------------------------+--------------------------------------+--------------------------------------------------------------------------------------- --------------+-------+--------------------------------------+---------- 213102 | 2024-09-09 10:15:54.440757+00 | 2024-09-09 10:15:54.440757+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:15:54.439+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host sno1: validation 'api-int-domain-name-resolved-correctly' that used to succeed is now failing | host_validation_failed | | b7785748-9f73-46e8-a11a-afefe2bfeb59 | warning 213088 | 2024-09-09 10:06:16.021777+00 | 2024-09-09 10:06:16.021777+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:06:16.021+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Done | host_install_progress_updated | | a711f06b-870f-4f5f-886a-882ed6ea4665 | info 213087 | 2024-09-09 10:06:16.019012+00 | 2024-09-09 10:06:16.019012+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:06:16.018+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host sno1: updated status from installing-in-progress to installed (Done) | host_status_updated | | a711f06b-870f-4f5f-886a-882ed6ea4665 | info 213086 | 2024-09-09 10:05:16.029495+00 | 2024-09-09 10:05:16.029495+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:05:16.029+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Joined | host_install_progress_updated | | 2a8028c1-a0d0-4145-92cf-ea32e6b3f7e6 | info 213085 | 2024-09-09 10:03:32.06692+00 | 2024-09-09 10:03:32.06692+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:32.066+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Rebooting: Ironic will reboot the node shortly | host_install_progress_updated | | fced0438-2f03-415f-913e-62da2d43431b | info 213084 | 2024-09-09 10:03:31.998935+00 | 2024-09-09 10:03:31.998935+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:31.998+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Uploaded logs for host sno1 cluster c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | host_logs_uploaded | | df3bc18a-d56a-4a20-84cb-d179fe3040f6 | info 213083 | 2024-09-09 10:03:12.621342+00 | 2024-09-09 10:03:12.621342+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:12.621+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Writing image to disk: 100% | host_install_progress_updated | | 69cad5b4-b606-406c-921e-4f7b0ababfb6 | info 213082 | 2024-09-09 10:03:12.158359+00 | 2024-09-09 10:03:12.158359+00 | | user | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:12.158+00 | aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Writing image to disk: 97%
But opening the Agent eventsURL (from 09/09 installation):
apiVersion: agent-install.openshift.io/v1beta1 kind: Agent metadata: annotations: inventory.agent-install.openshift.io/version: "0.1" creationTimestamp: "2024-09-09T09:55:46Z" finalizers: - agent.agent-install.openshift.io/ai-deprovision generation: 2 labels: agent-install.openshift.io/bmh: sno1 agent-install.openshift.io/clusterdeployment-namespace: sno1 infraenvs.agent-install.openshift.io: sno1 inventory.agent-install.openshift.io/cpu-architecture: x86_64 inventory.agent-install.openshift.io/cpu-virtenabled: "true" inventory.agent-install.openshift.io/host-isvirtual: "true" inventory.agent-install.openshift.io/host-manufacturer: RedHat inventory.agent-install.openshift.io/host-productname: KVM inventory.agent-install.openshift.io/storage-hasnonrotationaldisk: "false" name: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 namespace: sno1 ... ... debugInfo: eventsURL: https://assisted-service-multicluster-engine.apps.hub-sno.nokia-test.lab/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiJjZDBkZGRjMy1lODc5LTRjNzItOWU5ZC0zZDk4YmI3ODEzYmIifQ.eMlGvHeR69CoEA6OhtZX0uBZFeQOSRGOhYsqd1b0W3M78cGo1a2kbIKTz1eU80GUb70cU3v3pxKmxd19kpFaQA&host_id=aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 state: installed stateInfo: Done
Clicking on the eventsURL shows latest event as one of past 25/7, which means it is still showing past installations over the host and not the last one:
{ "cluster_id": "4df40e8d-b28e-4cad-88d3-fa5c37a81939", "event_time": "2024-07-25T00:37:15.538Z", "host_id": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201", "infra_env_id": "f6564380-9d04-47e3-afe9-b348204cf521", "message": "Host sno1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)", "name": "host_status_updated", "severity": "info" }
Trying to replicate the behavior on the postgres database, its like if there was around 50.000 entries max and it is shown the last one of it, something like:
installer=> SELECT * FROM (SELECT * FROM events WHERE host_id LIKE 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201' LIMIT 50000) AS A ORDER BY event_time DESC LIMIT 1; id | created_at | updated_at | deleted_at | category | cluster_id | event_time | host_id | infra_env_id | message | name | props | request_id | severity --------+-----------------------------+-----------------------------+------------+----------+--------------------------------------+----------------------------+---- ----------------------------------+--------------------------------------+------------------------------------------------------------------------------------------- ----------------------------------+---------------------+-------+--------------------------------------+---------- 170052 | 2024-07-29 04:41:53.4572+00 | 2024-07-29 04:41:53.4572+00 | | user | 4df40e8d-b28e-4cad-88d3-fa5c37a81939 | 2024-07-29 04:41:53.457+00 | aaa aaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | f6564380-9d04-47e3-afe9-b348204cf521 | Host sno1: updated status from known to preparing-for-installation (Host finished successf ully to prepare for installation) | host_status_updated | | 872c267a-499e-4b91-8bbb-fdc7ff4521aa | info
Expected results:
That the user can directly see in eventsURL latests events, in this scenario, they would be all from 09/09 installation and not from July
Description of problem:
[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-16-033047
How reproducible:
Always
Steps to Reproduce:
1. Use instance type "vt1.3xlarge"/"g4ad.xlarge"/"g4dn.xlarge" install Openshift cluster on AWS 2. Check the csinode allocatable volumes count $ oc get csinode ip-10-0-53-225.ec2.internal -ojsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}' 26 g4ad.xlarge # 25 g4dn.xlarge # 25 vt1.3xlarge # 26 $ oc get no/ip-10-0-53-225.ec2.internal -oyaml| grep 'instance-type' beta.kubernetes.io/instance-type: vt1.3xlarge node.kubernetes.io/instance-type: vt1.3xlarge 3. Create statefulset with pvc(which use the ebs csi storageclass), nodeAnffinity to the same node and set the replicas to the max volumesallocatable count to verify the the csinode allocatable volumes count is correct and all the pods should become Running # Test data apiVersion: apps/v1 kind: StatefulSet metadata: name: statefulset-vol-limit spec: serviceName: "my-svc" replicas: 26 selector: matchLabels: app: my-svc template: metadata: labels: app: my-svc spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - ip-10-0-53-225.ec2.internal # Make all volume attach to the same node containers: - name: openshifttest image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 volumeMounts: - name: data mountPath: /mnt/storage tolerations: - key: "node-role.kubernetes.io/master" effect: "NoSchedule" volumeClaimTemplates: - metadata: name: data spec: accessModes: [ "ReadWriteOnce" ] #storageClassName: gp3-csi resources: requests: storage: 1Gi
Actual results:
In step 3 there's some pods stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node
Expected results:
In step 3 all the pods with pvc should become "Running", and In step 2 the csinode allocatable volumes count should be correct -> g4ad.xlarge allocatable count should be 24 -> g4dn.xlarge allocatable count should be 24 -> vt1.3xlarge allocatable count should be 24
Additional info:
... attach or mount volumes: unmounted volumes=[data12 data6], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition 06-25 17:51:23.680 Warning FailedAttachVolume 4m1s (x13 over 14m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-d08d4133-f589-4aa3-bbef-f988058c419a" : rpc error: code = Internal desc = Could not attach volume "vol-0aa138f453d414ec3" to node "i-09d532f5155b3c05d": attachment of disk "vol-0aa138f453d414ec3" failed, expected device to be attached but was attaching 06-25 17:51:23.681 Warning FailedMount 3m40s (x3 over 10m) kubelet Unable to attach or mount volumes: unmounted volumes=[data6 data12], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition ...
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
Numbers input into NumberSpinnerField that are above 2147483647 are not accepted as integers
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Enter a number larger than 2147483647 into any NumberSpinnerField
Actual results:
Number is not accepted as an integer
Expected results:
There should be a separate validation error stating the number should be less than 2147483647
Additional info:
See https://github.com/openshift/console/pull/14084
Description of problem:
On NetworkPolicies page, select MultiNetworkPolicies and create the policy, the created policy is not MultiNetworkPolicy, but NetworkPolicy.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create a MultiNetworkPolicy 2. 3.
Actual results:
The policy is a NetworkPolicy, not MultiNetworkPolicy
Expected results:
It's MultiNetworkPolicy
Additional info:
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
Tracker issue for bootimage bump in 4.18. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-41259.
Description of problem:
The message of the co olm conditions of Upgradeable is not correct if one ClusterExtension(without olm.maxOpenShiftVersion) is installed.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-06-223232
How reproducible:
always
Steps to Reproduce:
1.create ClusterCatalog apiVersion: olm.operatorframework.io/v1alpha1 kind: ClusterCatalog metadata: name: catalog-1 labels: example.com/support: "true" provider: olm-1 spec: priority: 1000 source: type: image image: ref: quay.io/openshifttest/nginxolm-operator-index:nginxolm74108 2. create ns and sa 3. create ClusterExtension apiVersion: olm.operatorframework.io/v1alpha1 kind: ClusterExtension metadata: name: test-74108 spec: source: sourceType: Catalog catalog: packageName: nginx74108 channels: - candidate-v1.1 install: serviceAccount: name: sa-74108 namespace: test-74108 4. check co olm status status: conditions: - lastTransitionTime: "2024-10-08T11:51:01Z" message: 'OLMIncompatibleOperatorControllerDegraded: error with cluster extension test-74108: error in bundle nginx74108.v1.1.0: could not convert olm.properties: failed to unmarshal properties annotation: unexpected end of JSON input' reason: OLMIncompatibleOperatorController_SyncError status: "True" type: Degraded - lastTransitionTime: "2024-10-08T02:16:36Z" message: All is well reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2024-10-08T02:16:36Z" message: All is well reason: AsExpected status: "True" type: Available - lastTransitionTime: "2024-10-08T11:48:26Z" message: 'InstalledOLMOperatorsUpgradeable: error with cluster extension test-74108: error in bundle nginx74108.v1.1.0: could not convert olm.properties: failed to unmarshal properties annotation: unexpected end of JSON input' reason: InstalledOLMOperators_FailureGettingExtensionMetadata status: "False" type: Upgradeable - lastTransitionTime: "2024-10-08T02:09:59Z" reason: NoData status: Unknown type: EvaluationConditionsDetected
Actual results:
co olm is Degraded
Expected results:
co olm is OK
Additional info:
Bellow annotation of CSV is not configured olm.properties: '[{"type": "olm.maxOpenShiftVersion", "value": "4.8"}]'
Description of problem:
When the console is loaded there are errors in the browsers console abouth failing to fetch networking-console-plugin locales.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The issue is also effecting console CI
Description of problem:
4.18 efs controller, node pods are left behind after uninstalling driver
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-08-075347
How reproducible:
Always
Steps to Reproduce:
1. Install 4.18 EFS operator, driver on cluster and check the efs pods are all up and Running 2. Uninstall EFs driver and check the controller, node pods gets deleted
Execution on 4.16 and 4.18 clusters
4.16 cluster oc create -f og-sub.yaml oc create -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-controller-b8858785-72tp9 4/4 Running 0 4s aws-efs-csi-driver-controller-b8858785-gvk4b 4/4 Running 0 6s aws-efs-csi-driver-node-2flqr 3/3 Running 0 9s aws-efs-csi-driver-node-5hsfp 3/3 Running 0 9s aws-efs-csi-driver-node-kxnlv 3/3 Running 0 9s aws-efs-csi-driver-node-qdshm 3/3 Running 0 9s aws-efs-csi-driver-node-ss28h 3/3 Running 0 9s aws-efs-csi-driver-node-v9zwx 3/3 Running 0 9s aws-efs-csi-driver-operator-65b55bf877-4png9 1/1 Running 0 2m53s oc get clustercsidrivers | grep "efs" efs.csi.aws.com 2m26s oc delete -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-operator-65b55bf877-4png9 1/1 Running 0 4m40s 4.18 cluster oc create -f og-sub.yaml oc create -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-controller-56d68dc976-847lr 5/5 Running 0 9s aws-efs-csi-driver-controller-56d68dc976-9vklk 5/5 Running 0 11s aws-efs-csi-driver-node-46tsq 3/3 Running 0 18s aws-efs-csi-driver-node-7vpcd 3/3 Running 0 18s aws-efs-csi-driver-node-bm86c 3/3 Running 0 18s aws-efs-csi-driver-node-gz69w 3/3 Running 0 18s aws-efs-csi-driver-node-l986w 3/3 Running 0 18s aws-efs-csi-driver-node-vgwpc 3/3 Running 0 18s aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv 1/1 Running 0 2m55s oc get clustercsidrivers efs.csi.aws.com 2m19s oc delete -f driver.yaml oc get pods | grep "efs" aws-efs-csi-driver-controller-56d68dc976-847lr 5/5 Running 0 4m58s aws-efs-csi-driver-controller-56d68dc976-9vklk 5/5 Running 0 5m aws-efs-csi-driver-node-46tsq 3/3 Running 0 5m7s aws-efs-csi-driver-node-7vpcd 3/3 Running 0 5m7s aws-efs-csi-driver-node-bm86c 3/3 Running 0 5m7s aws-efs-csi-driver-node-gz69w 3/3 Running 0 5m7s aws-efs-csi-driver-node-l986w 3/3 Running 0 5m7s aws-efs-csi-driver-node-vgwpc 3/3 Running 0 5m7s aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv 1/1 Running 0 7m44s oc get clustercsidrivers | grep "efs" => Nothing is there
Actual results:
The EFS controller, node pods are left behind
Expected results:
After uninstalling driver the EFS controller, node pods should get deleted
Additional info:
On 4.16 cluster this is working fine EFS Operator logs: oc logs aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv E1009 07:13:41.460469 1 base_controller.go:266] "LoggingSyncer" controller failed to sync "key", err: clustercsidrivers.operator.openshift.io "efs.csi.aws.com" not found Discussion: https://redhat-internal.slack.com/archives/C02221SB07R/p1728456279493399
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/249
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Go to NetworkPolicies page, make sure they have policies in each tab. Go to MultiNetworkPolicies tab and create a filter, then move the the first tab(NetworkPolicies tab), it does not show the policies any more.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Have policies on NetworkPolicies tab and MultiNetworkPolicies tab 2. Create a filter on MultiNetworkPolicies tab 3. Go to NetworkPolicies tab
Actual results:
It shows "Not found"
Expected results:
the list of networkpolicies shows up
Additional info:
Description of problem:
External network ID should be an optional CLI option but when not given, the Hypershift Operator crashes with a nil pointer error.
Version-Release number of selected component (if applicable):
4.18 and 4.17
Description of problem:
Creation of pipeline through import from git using devfile repo does not work
Version-Release number of selected component (if applicable):
How reproducible:
Everytime
Steps to Reproduce:
1. Create a pipeline from import from git form using devfile repo `https://github.com/nodeshift-starters/devfile-sample.git` 2. Check pipelines page 3.
Actual results:
No pipeline is created instead build config is created for it
Expected results:
If the pipeline option is showing in _import from git form _for a repo, the pipeline should be generated
Additional info:
Description of problem:
Completions column values need to be marked for translation.
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Steps to Reproduce:
1. Navigate to Workloads - Jobs 2. Values under Completions column are in English 3.
Actual results:
Content is in English
Expected results:
Content should be in target language
Additional info:
screenshot provided
arm64 is dev preview by CNV since 4.14. The installer shouldn't block installing it.
Just make sure it is shown in the UI as dev preview.
In all releases tested, in particular, 4.16.0-0.okd-scos-2024-08-21-155613, Samples operator uses incorrect templates, resulting in following alert:
Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: fuse7-eap-openshift fuse7-eap-openshift-java11 fuse7-java-openshift fuse7-java11-openshift fuse7-karaf-openshift-jdk11 golang httpd java jboss-datagrid73-openshift jboss-eap-xp3-openjdk11-openshift jboss-eap-xp3-openjdk11-runtime-openshift jboss-eap-xp4-openjdk11-openshift jboss-eap-xp4-openjdk11-runtime-openshift jboss-eap74-openjdk11-openshift jboss-eap74-openjdk11-runtime-openshift jboss-eap74-openjdk8-openshift jboss-eap74-openjdk8-runtime-openshift jboss-webserver57-openjdk8-tomcat9-openshift-ubi8 jenkins jenkins-agent-base mariadb mysql nginx nodejs perl php postgresql13-for-sso75-openshift-rhel8 postgresql13-for-sso76-openshift-rhel8 python redis ruby sso75-openshift-rhel8 sso76-openshift-rhel8 fuse7-karaf-openshift jboss-webserver57-openjdk11-tomcat9-openshift-ubi8 postgresql
For example, the sample image for Mysql 8.0 is being pulled from registry.redhat.io/rhscl/mysql-80-rhel7:latest (and cannot be found using the dummy pull secret).
Works correctly on OKD FCOS builds.
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/76
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/216
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In GetMirrorFromRelease() https://github.com/openshift/installer/blob/master/pkg/asset/agent/mirror/registriesconf.go#L313-L328, the agent installer sets the mirror for the release image based on the source url.
This setting is then used in assisted-service to extract images etc. https://github.com/openshift/assisted-service/blob/master/internal/oc/release.go#L328-L340 in conjunction with the icsp file.
The problem is that GetMirrorFromRelease() returns just the first entry in registries.conf so its not really the actual mirror in the case when a source has multiple mirrors. A better way to handle this would be to net set the env variable OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR and just let the resolving of the mirror be handled by the icsp-file. Its currently using the icsp-file but since the source has changed to the mirror it might not use these if, for example, the first mirror does not have the manifest file.
We've had an internal report of a failure when using mirroring:
Oct 01 10:06:16 master-0 agent-register-cluster[7671]: time="2024-10-01T14:06:16Z" level=fatal msg="Failed to register cluster with assisted-service: command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=true --icsp-file=/tmp/icsp-file2810072099 registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b --registry-config=/tmp/registry-config204889789' exited with non-zero exit code 1: \nFlag --icsp-file has been deprecated, support for it will be removed in a future release. Use --idms-file instead.\nerror: image \"registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b\" not found: manifest unknown: manifest unknown\n"
When using the mirror config:
[[registry]] location = "quay.io/openshift-release-dev/ocp-release" mirror-by-digest-only = true prefix = "" [[registry.mirror]] location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev" [[registry.mirror]] location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release" [[registry]] location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev" mirror-by-digest-only = true prefix = "" [[registry.mirror]] location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev" [[registry.mirror]] location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release"
Description of problem:
The library-sync.sh script may leave some files of the unsupported samples in the checkout. In particular, the files that have been renamed are not deleted even though they should have.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Run library-sync.sh
Actual results:
A couple of files under assets/operator/ocp-x86_64/fis are present.
Expected results:
The directory should not be present at all, because it is not supported.
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/268
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/network-tools/pull/133
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/586
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/search?q=repo%3Aopenshift%2Fconsole+name+%3D%3D%3D+%27%7Enew%27&type=code shows a number of instances in Console code where there is a check for a resource name with a value of "~new". This check is not valid as a resource name cannot include "~". We should remove these invalid checks.
Component Readiness has found a potential regression in the following test:
[bz-Routing] clusteroperator/ingress should not change condition/Available
Probability of significant regression: 97.63%
Sample (being evaluated) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-09-09T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 67
Failures: 0
Flakes: 0
It is worth mentioning that in two of the three failures, ingress operator went available=false at the same time image registry went available=false. This is one example.
Team can investigate, and if legit reason exists, please create an exception with origin and address it at proper time: https://github.com/openshift/origin/blob/4557bdcecc10d9fa84188c1e9a36b1d7d162c393/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L90
Since this is appearing on component readiness dashboard and the management depends on a green dashboard to make release decisions, please give the initial investigation a high priority. If an exception is needed, please contact TRT team to triage the issue.
We are aiming to find containers that are restarting more than 3 times in the progress of an e2e test. Critical pods like metal3-static-ip-set should not be restarting more than 3 times in the progress of a test.
Can your team investigate this and aim to fix for it?
For now, we will exclude our test from failing.
for an example of how much this container restarts in the progress of a test.
Description of problem:
See attached screenshots. Different operator versions have different descriptions but Operator hub shows still the same description for whatever operator version is selected.
Version-Release number of selected component (if applicable):
OCP 4.16
How reproducible:
Always
Steps to Reproduce:
1.Open Operator Hub and find Sail operator 2.Select Sail Operator 3.Choose different versions and channels
Actual results:
Description is always the same even though actual description for given version is different.
Expected results:
Expected behavior - when selecting different operator versions during installation the description should be updated according to selected operator.
Additional info:
See attachments in original issue https://issues.redhat.com/browse/OPECO-3239
Description of problem:
After upgrading the cluster to 4.15 the Prometheus Operator´s "Prometheus" tab does not show the Prometheuses, they can still be viewed and accessed through the "All instances" tab
Version-Release number of selected component (if applicable):
OCP v4.15
Steps to Reproduce:
1. Install prometheus operator from operator hub 2. create prometheus instance 3. Instance will be visible under all instances tab , not under prometheus tab
Actual results:
Prometheus instance be visible in all instance tab only
Expected results:
Prometheus instance should be visible in all instance along with prometheus tab
Description of problem:
The Azure cloud node manager uses a service account with a cluster role attached that provides it with cluster wide permissions to update Node objects. This means, were the service account to become compromised, Node objects could be maliciously updated. To limit the blast radius of a leak, we should determine if there is a way to limit the Azure Cloud Node Manager to only be able to update the node on which it resides, or, to move it's functionality centrally within the cluster. Possible paths: * Check upstream progress for any attempt to move the node manager role into the CCM * See if we can re-use kubelet credentials as these are already scoped to updating only the Node on which they reside * See if there's another admission control method we can use to limit the updates (possibly https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/)
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In Azure Stack, the Azure-Disk CSI Driver node pod CrashLoopBackOff: openshift-cluster-csi-drivers azure-disk-csi-driver-node-57rxv 1/3 CrashLoopBackOff 33 (3m55s ago) 59m 10.0.1.5 ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-m62cj <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-8wvqm 1/3 CrashLoopBackOff 35 (29s ago) 67m 10.0.0.6 ci-op-q8b6n4iv-904ed-kp5mv-master-1 <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-97ww5 1/3 CrashLoopBackOff 33 (12s ago) 67m 10.0.0.7 ci-op-q8b6n4iv-904ed-kp5mv-master-2 <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-9hzw9 1/3 CrashLoopBackOff 35 (108s ago) 59m 10.0.1.4 ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-gjqmw <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-glgzr 1/3 CrashLoopBackOff 34 (69s ago) 67m 10.0.0.8 ci-op-q8b6n4iv-904ed-kp5mv-master-0 <none> <none> openshift-cluster-csi-drivers azure-disk-csi-driver-node-hktfb 2/3 CrashLoopBackOff 48 (63s ago) 60m 10.0.1.6 ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-kdbpf <none> <none>
The CSI-Driver container log: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0xc8 pc=0x18ff5db] goroutine 228 [running]: sigs.k8s.io/cloud-provider-azure/pkg/provider.(*Cloud).GetZone(0xc00021ec00, {0xc0002d57d0?, 0xc00005e3e0?}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_zones.go:182 +0x2db sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).NodeGetInfo(0xc000144000, {0x21ebbf0, 0xc0002d5470}, 0x273606a?) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/nodeserver.go:336 +0x13b github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler.func1({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7160 +0x72 sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320?}, 0xc0003b0340, 0xc00050ae10) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409 github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler({0x1ec2f40?, 0xc000144000}, {0x21ebbf0, 0xc0002d5470}, 0xc000054680, 0x20167a0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7162 +0x135 google.golang.org/grpc.(*Server).processUnaryRPC(0xc000530000, {0x21ebbf0, 0xc0002d53b0}, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40, 0xc00052c810, 0x30fa1c8, 0x0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1343 +0xe03 google.golang.org/grpc.(*Server).handleStream(0xc000530000, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1737 +0xc4c google.golang.org/grpc.(*Server).serveStreams.func1.1() /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:986 +0x86 created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 260 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:997 +0x145
The registrar container log: E0321 23:08:02.679727 1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = error reading from server: EOF, restarting registration container.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-21-152650
How reproducible:
See it in CI profile, and manual install failed earlier.
Steps to Reproduce:
See Description
Actual results:
Azure-Disk CSI Driver node pod CrashLoopBackOff
Expected results:
Azure-Disk CSI Driver node pod should be running
Additional info:
See gather-extra and must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-azure-stack-ipi-proxy-fips-f2/1770921405509013504/artifacts/azure-stack-ipi-proxy-fips-f2/
Please review the following PR: https://github.com/openshift/image-registry/pull/411
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819
4.18 Micro upgrade failures began with the initial payload 4.18.0-0.ci-2024-08-09-234503
CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346
The common change in all of those scenarios appears to be:
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4437
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4518
Description of problem:
openshift-apiserver that sends traffic through konnectivity proxy is sending traffic intended for the local audit-webhook service. The audit-webhook service should be included in the NO_PROXY env var of the openshift-apiserver container.
4.14.z,4.15.z,4.15.z,4.16.z
How reproducible:{code:none} Always
Steps to Reproduce:
1. Create a rosa hosted cluster 2. Obeserve logs of the konnectivity-proxy sidecar of openshift-apiserver 3.
Actual results:
Logs include requests to the audit-webhook local service
Expected results:
Logs do not include requests to audit-webhook
Additional info:
Slack thread asking apiserver team
We saw excess pathological events tests that failed aggregated jobs in aws and gcp jobs for 4.18.0-0.ci-2024-09-26-062917 (azure has them too and now failed in 4.18.0-0.nightly-2024-09-26-093014). The events are in namespace/openshift-apiserver-operator and namespace/openshift-authentication-operator – reason/DeploymentUpdated Updated Deployment.apps/apiserver -n openshift-oauth-apiserver because it changed
Examples:
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/127
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/24
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
After successful deployment, trying to delete spoke resources.
BMHs are not being removed and stuck.
How reproducible:
Always
Steps to reproduce:
1. Deploy spoke node (tested in disconnected + IPV6 but CI also fails on ipv4)
2. Try to delete BMH (after deleting agents)
3.
Actual results:
BMH is still in provisioned state and not being deleted.
From assisted logs:
-------
time="2024-09-20T21:02:23Z" level=error msg="failed to delete BMH" func=github.com/openshift/assisted-service/internal/controller/controllers.removeSpokeResources file="/remote-source/assisted-service/app/internal/controller/controllers/agent_controller.go:450" agent=6df557e8-00af-4377-ac93-096b66c8e3c6 agent_namespace=spoke-0 error="failed to remove BMH openshift-machine-api/spoke-worker-0-1 finalizers: Internal error occurred: failed calling webhook \"baremetalhost.metal3.io\": failed to call webhook: Post \"https://baremetal-operatf557e8-00af-4377-ac93-096b66c8e3c6 agent_namespace=spoke-0 error="failed to remove BMH openshift-machine-api/spoke-worker-0-1 finalizers: Internal error occurred: failed calling webhook \"baremetalhost.metal3.io\": failed to call webhook: Post \"https://baremetal-operator-webhook-service.openshift-machine-api.svc:443/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s\": no endpoints available for service \"baremetal-operator-webhook-service\"" go-id=393 hostname=spoke-worker-0-1 machine=spoke-0-f9w48-worker-0-x484f machine_namespace=openshift-machine-api machine_set=spoke-0-f9w48-worker-0 node=spoke-w
--------
Expected results:
BMH shoud be deleted
must-gather: https://drive.google.com/file/d/1JOeDGTzQNgDy9ZdjlJMcRi-hksB6Iz9h/view?usp=drive_link
Description of the problem:
B[Staging]BE 2.35.0, UI 2.34.2 - [Staging] - BE allows LVMS and ODF to be enabled
How reproducible:
100%
Steps to reproduce:
1.
Actual results:
Expected results:
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/364
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1332
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The assisted service is throwing an error message stating that the Cloud Controller Manager (CCM) is not enabled, even though the CCM value is correctly set in the install-config file.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-19-045205
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config and agent-config for external OCI platform. example of install-config configuration ....... ....... platform: external platformName: oci cloudControllerManager: External ....... ....... 2. Create agent ISO for external OCI platform 3. Boot up nodes using created agent ISO
Actual results:
Oct 21 16:40:47 agent-sno.private.agenttest.oraclevcn.com service[2829]: time="2024-10-21T16:40:47Z" level=info msg="Register cluster: agenttest with id 2666753a-0485-420b-b968-e8732da6898c and params {\"api_vips\":[],\"base_dns_domain\":\"abitest.oci-rhelcert.edge-sro.rhecoeng.com\",\"cluster_networks\":[{\"cidr\":\"10.128.0.0/14\",\"host_prefix\":23}],\"cpu_architecture\":\"x86_64\",\"high_availability_mode\":\"None\",\"ingress_vips\":[],\"machine_networks\":[{\"cidr\":\"10.0.0.0/20\"}],\"name\":\"agenttest\",\"network_type\":\"OVNKubernetes\",\"olm_operators\":null,\"openshift_version\":\"4.18.0-0.nightly-2024-10-19-045205\",\"platform\":{\"external\":{\"cloud_controller_manager\":\"\",\"platform_name\":\"oci\"},\"type\":\"external\"},\"pull_secret\":\"***\",\"schedulable_masters\":false,\"service_networks\":[{\"cidr\":\"172.30.0.0/16\"}],\"ssh_public_key\":\"ssh-rsa XXXXXXXXXXXX\",\"user_managed_networking\":true,\"vip_dhcp_allocation\":false}" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/src/internal/bminventory/inventory.go:515" cluster_id=2666753a-0485-420b-b968-e8732da6898c go-id=2110 pkg=Inventory request_id=82e83b31-1c1b-4dea-b435-f7316a1965e
Expected results:
The cluster installation should be successful.
Description of problem:
When doing the mirror to mirror, will count the operator catalog image twice : ✓ 70/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 ✓ 71/81 : (3s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15 ✓ 72/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 2024/09/06 04:55:05 [INFO] : Mirroring is ongoing. No errors. ✓ 73/81 : (0s) oci:///test/ibm-catalog ✓ 74/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 ✓ 75/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 ✓ 76/81 : (3s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 ✓ 77/81 : (1s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 ✓ 78/81 : (3s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 ✓ 79/81 : (2s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 ✓ 80/81 : (2s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-unknown-7b0b3bf2", GitCommit:"7b0b3bf2", GitTreeState:"clean", BuildDate:"2024-09-06T01:32:29Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. do mirror2mirror with following imagesetconfig: cat config-136.yaml apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: operators: - catalog: oci:///test/ibm-catalog - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: windows-machine-config-operator - name: cluster-kube-descheduler-operator - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14 packages: - name: servicemeshoperator - name: windows-machine-config-operator - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15 packages: - name: nvidia-network-operator - catalog: registry.redhat.io/redhat/community-operator-index:v4.15 packages: - name: skupper-operator - name: reportportal-operator - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.15 packages: - name: dynatrace-operator-rhmp `oc-mirror -c config-136.yaml docker://localhost:5000/m2m06 --workspace file://m2m6 --v2 --dest-tls-verify=false`
Actual results:
will count the operator catalog images twice : ✓ 70/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 ✓ 71/81 : (3s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15 ✓ 72/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 2024/09/06 04:55:05 [INFO] : Mirroring is ongoing. No errors. ✓ 73/81 : (0s) oci:///test/ibm-catalog ✓ 74/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 ✓ 75/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 ✓ 76/81 : (3s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 ✓ 77/81 : (1s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 ✓ 78/81 : (3s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 ✓ 79/81 : (2s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 ✓ 80/81 : (2s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15
Expected results:
Should only count the operator catalog image corretly
Additional info:
Description of problem:
The hypershift CLI has an implicit dependency on the az and jq commands, as it invokes them directly. As a result, the "hypershift-azure-create" chain will not work since it's based on the hypershift-operator image, which lacks these tools.
Expected results:
Refactor the hypershift CLI to handle these dependencies in a Go-native way, so that the CLI is self-contained.
Description of problem:
Circular dependencies in OCP Console prevent migration of Webpack 5
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Enable the CHECK_CYCLES env var while building 2. Observe errors 3.
Actual results:
There are errors
Expected results:
No errors
Additional info:
If the network to the bootstrap VM is slow, the extract-machine-os.service can time out (after 180s). If this happens, it will be restarted but services that depend on it (like ironic) will never be started even once it succeeds. systemd added support for Restart:on-failure for Type:oneshot services, but they still don't behave the same way as other types of services.
This can be simulated in dev-scripts by doing:
sudo tc qdisc add dev ostestbm root netem rate 33Mbit
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/445
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When deleting an AWS HostedCluster with endpoint access of type PublicAndPrivate or Private, the VPC endpoint for the HostedCluster is not always cleaned up when the HostedCluster is deleted.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Most of the time
Steps to Reproduce:
1. Create a HostedCluster on AWS with endpoint access PublicAndPrivate 2. Wait for the HostedCluster to finish deploying 3. Delete the HostedCluster by deleting the HostedCluster resource (oc delete hostedcluster/[name] -n clusters)
Actual results:
The vpc endpoint and/or the DNS entries in the hypershift.local hosted zone that corresponds to the hosted cluster are not removed.
Expected results:
The vpc endpoint and DNS entries in the hypershift.local hosted zone are deleted when the hosted cluster is cleaned up.
Additional info:
With current code, the namespace is deleted before the control plane operator finishes cleanup of the VPC endpoint and related DNS entries.
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/92
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Pull image from gcp artifact registry failed
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1. Create repo for gcp artifact registry: zhsun-repo1 2. Login to registry gcloud auth login gcloud auth configure-docker us-central1-docker.pkg.dev 3. Push image to registry $ docker pull openshift/hello-openshift $ docker tag openshift/hello-openshift:latest us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest $ docker push us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest 4. Create pod $ oc new-project hello-gcr $ oc new-app --name hello-gcr --allow-missing-images \ --image us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest 5. Check pod status
Actual results:
Pull image failed. must-gather: https://drive.google.com/file/d/1o9cyJB53vQtHNmL5EV_hIx9I_LzMTB0K/view?usp=sharing kubelet log: https://drive.google.com/file/d/1tL7HGc4fEOjH5_v6howBpx2NuhjGKsTp/view?usp=sharing $ oc get po NAME READY STATUS RESTARTS AGE hello-gcr-658f7f9869-76ssg 0/1 ImagePullBackOff 0 3h24m $ oc describe po hello-gcr-658f7f9869-76ssg Warning Failed 14s (x2 over 15s) kubelet Error: ImagePullBackOff Normal Pulling 2s (x2 over 16s) kubelet Pulling image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest" Warning Failed 1s (x2 over 16s) kubelet Failed to pull image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest": rpc error: code = Unknown desc = Requesting bearer token: invalid status code from registry 403 (Forbidden)
Expected results:
Can pull image from artifact registry succeed
Additional info:
gcr.io works as expected. us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest doesn't work. $ oc get po -n hello-gcr NAME READY STATUS RESTARTS AGE hello-gcr-658f7f9869-76ssg 0/1 ImagePullBackOff 0 156m hello-gcr2-6d98c475ff-vjkt5 1/1 Running 0 163m $ oc get po -n hello-gcr -o yaml | grep image - image: us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest - image: gcr.io/openshift-qe/hello-gcr:latest
Revert https://issues.redhat.com//browse/CNV-39065
as we don't need this chart anymore
Description of problem:
In MultiNetworkPolicies page, the learn more link does not work
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to Networking -> NetworkPolicies -> MultiNetworkPolicies 2. 3.
Actual results:
Expected results:
Additional info:
It is been observed that the esp_offload kernel module might be loaded by libreswan even if bond ESP offloads have been correctly turned off.
This might be because ipsec service and configure-ovs run at the same time, so it is possible that ipsec service starts when bond offloads are not yet turned off and trick libreswan into thinking they should be used.
The potential fix would be to run ipsec service after configure-ovs.
Description of problem:
Renable knative and A-04-TC01 tests that are being disabled in the pr https://github.com/openshift/console/pull/13931
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus/pull/226
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Update the installer to use commit c6bcd313bce0fc9866e41bb9e3487d9f61c628a3 of cluster-api-provider-ibmcloud. This includes a couple of necessary Transit Gateway fixes.
in order to ease CI builds and konflux integrations, and have standardise with other observability plugins we need to migrate away from yarn and use npm
The monitoring plugin uses npm instead of yarn for development and in Dockerfiles
Please review the following PR: https://github.com/openshift/cluster-api/pull/222
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-cluster-api-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/80
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Should not panic when specify wrong loglevel for oc-mirror
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1. Run command: `oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2 --loglevel -h`
Actual results:
The command panic with error: oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2 --loglevel -h 2024/07/31 05:22:41 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/07/31 05:22:41 [INFO] : 👋 Hello, welcome to oc-mirror 2024/07/31 05:22:41 [INFO] : ⚙️ setting up the environment for you... 2024/07/31 05:22:41 [INFO] : 🔀 workflow mode: diskToMirror 2024/07/31 05:22:41 [ERROR] : parsing config error parsing local storage configuration : invalid loglevel -h Must be one of [error, warn, info, debug] panic: StorageDriver not registered: goroutine 1 [running]:github.com/distribution/distribution/v3/registry/handlers.NewApp({0x5634e98, 0x76ea4a0}, 0xc000a7c388) /go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:126 +0x2374github.com/distribution/distribution/v3/registry.NewRegistry({0x5634e98?, 0x76ea4a0?}, 0xc000a7c388) /go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/registry.go:141 +0x56github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).setupLocalStorage(0xc000a78488) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:571 +0x3c6github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc00090f208, {0xc0007ae300, 0x1, 0x8}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:201 +0x27fgithub.com/spf13/cobra.(*Command).execute(0xc00090f208, {0xc0000520a0, 0x8, 0x8}) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1github.com/spf13/cobra.(*Command).ExecuteC(0xc00090f208) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ffgithub.com/spf13/cobra.(*Command).Execute(0x74bc8d8?) /go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13main.main() /go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
Expected results:
Exit with error , should not panic
Please review the following PR: https://github.com/openshift/aws-encryption-provider/pull/21
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Update OWNERS subcomponents for Cluster API Providers
context: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1725451236624749
https://github.com/openshift/origin/pull/28945 are permafailing on metal
https://github.com/openshift/api/pull/1988 maybe needs to be reverted?
Currently, CMO only tests that the plugin Deployment is rolled out with the appropriate config https://github.com/openshift/cluster-monitoring-operator/blob/f7e92e869c43fa0455d656dcfc89045b60e5baa1/test/e2e/config_test.go#L730
The plugin Deployment does set any readinessProbe, we're missing a check to ensure the plugin is ready to serve requests.
—
With the new plugin backend, a readiness probe can/will be added, see https://github.com/openshift/cluster-monitoring-operator/pull/2412#issuecomment-2315085438, that will help ensure minimal readiness on palyload tests flavors.
The CMO test can be more demanding and ask for /plugin-manifest.json
Description of problem:
See:
event happened 183 times, something is wrong: node/ip-10-0-52-0.ec2.internal hmsg/9cff2a8527 - reason/ErrorUpdatingResource error creating gateway for node ip-10-0-52-0.ec2.internal: failed to configure the policy based routes for network "default": invalid host address: 10.0.52.0/18 (17:55:20Z) result=reject |
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/36
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
When console-operator performs health check for the active console route, the retry takes 50ms, which is too short. It should be bumped at least to couple of seconds, to prevent burst of request which could lead to the same result and thus be misleading. Also we need to add additional logging around the healthcheck for better debugging.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The GCP environment parament is missing on GCP STS environment Based on feature https://issues.redhat.com/browse/CONSOLE-4176 If the cluster is in GCP WIF mode and the operator claims support for it, the operator subscription page provides configuring 4 additional fields,which will be set on the Subscription's spec.config.env field
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-23-112324
How reproducible:
Always
Steps to Reproduce:
1. Prepare a GCP WIF mode enable cluster 2. Navigate to Operator Hub page, and selected 'Auth Token GCP' on the Infrastructure features section 3. Choose one operator and click install button (eg: Web Terminal) 4. Check the Operator subscription page /operatorhub/subscribe?pkg=web-terminal&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=undefined&channel=fast&version=1.11.0&tokenizedAuth=null
Actual results:
The fuction for feature CONSOLE-4176 is missing
Expected results:
1. WI/FI Warning message can shown on the subscription page 2. User can setup POOL_ID, PROVIDER_ID,SERVICE_ACCOUNT_EMAIL on the page
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Adding of ATB to HC doesnt create new user token and doesnt reconcile to worker ndoes
Version-Release number of selected component (if applicable):
4.18 nightly
How reproducible:
100%
Steps to Reproduce:
1.Create 4.18 nightly HC 2.Add atb to HC 3.Notice no new user token
Actual results:
no new user token generated so no new payload
Expected results:
new user token generated with new payload
Additional info:
Hello Team,
When we deploy the HyperShift cluster with OpenShift Virtualization by specifying NodePort strategy for services, the requests to ignition, oauth, connectivity (for oc rsh, oc logs, oc exec), virt-launcher-hypershift-node-pool pod fails as by default following netpols get created automatically and restricting the traffic on on all other ports.
$ oc get netpol NAME POD-SELECTOR AGE kas app=kube-apiserver 153m openshift-ingress <none> 153m openshift-monitoring <none> 153m same-namespace <none> 153m
I resolved
$ cat ingress-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: ingress spec: ingress: - ports: - port: 31032 protocol: TCP podSelector: matchLabels: kubevirt.io: virt-launcher policyTypes: - Ingress $ cat oauth-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: oauth spec: ingress: - ports: - port: 6443 protocol: TCP podSelector: matchLabels: app: oauth-openshift hypershift.openshift.io/control-plane-component: oauth-openshift policyTypes: - Ingress $ cat ignition-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: nodeport-ignition-proxy spec: ingress: - ports: - port: 8443 protocol: TCP podSelector: matchLabels: app: ignition-server-proxy policyTypes: - Ingress $ cat konn-netpol apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: konn spec: ingress: - ports: - port: 8091 protocol: TCP podSelector: matchLabels: app: kube-apiserver hypershift.openshift.io/control-plane-component: kube-apiserver policyTypes: - Ingress
The bug for ignition netpol has already been reported.
--> https://issues.redhat.com/browse/OCPBUGS-39158
--> https://issues.redhat.com/browse/OCPBUGS-39317
It would be helpful if these policies get created automatically as well or maybe we get an option in HyperShift to disable the automatic management of network policies where we can manually take care of the network policies.
Description of problem:
ose-aws-efs-csi-driver-operator has an invalid reference tools that cause build failed this issue is due to https://github.com/openshift/csi-operator/pull/252/files#r1719471717
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
"import * as icon from '[...].svg' " imports cause errors on webpack5/rspack (can't convert value to primitive type). They should be rewritten as "import icon from '[...].svg'"
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
as a follow up issue of https://issues.redhat.com/browse/OCPBUGS-4496
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-08-130531
How reproducible:
Always
Steps to Reproduce:
1. create a ConfigMap ConsoleYAMLSample without 'snippet: true' apiVersion: console.openshift.io/v1 kind: ConsoleYAMLSample metadata: name: cm-example-without-snippet spec: targetResource: apiVersion: v1 kind: ConfigMap title: Example ConfigMap description: An example ConfigMap YAML sample yaml: | apiVersion: v1 kind: ConfigMap metadata: name: game-demo data: player_initial_lives: "3" ui_properties_file_name: "user-interface.properties" game.properties: | enemy.types=aliens,monsters player.maximum-lives=5 user-interface.properties: | color.good=purple color.bad=yellow allow.textmode=true 2. goes to ConfigMap creation page -> YAML view 3. create a ConfigMap ConsoleYAMLSample WITH 'snippet: true' apiVersion: console.openshift.io/v1 kind: ConsoleYAMLSample metadata: name: cm-example-without-snippet spec: targetResource: apiVersion: v1 kind: ConfigMap title: Example ConfigMap description: An example ConfigMap YAML sample snippet: true yaml: | apiVersion: v1 kind: ConfigMap metadata: name: game-demo data: player_initial_lives: "3" ui_properties_file_name: "user-interface.properties" game.properties: | enemy.types=aliens,monsters player.maximum-lives=5 user-interface.properties: | color.good=purple color.bad=yellow allow.textmode=true 4. goes to ConfigMap creation page -> YAML view
Actual results:
2. Sample tab doesn't show up 4. Snippet tab appears
Expected results:
2. Sample tab should show up when there is no snippet: true
Additional info:
Description of problem:
Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized. Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster). Controller manager is filled with entries like: - "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4" - "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org" In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check. The duplication is evident in OpenShiftControllerManager/cluster: dockerPullSecret: internalRegistryHostname: image-registry.openshift-image-registry.svc:5000 registryURLs: - registry.build01.ci.openshift.org - registry.build01.ci.openshift.org But there is only one hostname in config.imageregistry.operator.openshift.io/cluster: routes: - hostname: registry.build01.ci.openshift.org name: public-routes secretName: public-route-tls
Version-Release number of selected component (if applicable):
4.17.0-rc.3
How reproducible:
Constant on build01 but not on other build farms
Steps to Reproduce:
1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager. 2. 3.
Actual results:
- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces. - The openshift-controller-manager is hot looping and experiencing client throttling.
Expected results:
1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes. 2. openshift-controller-manager should not possess duplicate entries. 3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries. 4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/238
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-cluster-machine-approver-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
Topology screen crashes and reports "Oh no! something went wrong" when a pod in completed state is selected.
Version-Release number of selected component (if applicable):
RHOCP 4.15.18
How reproducible:
100%
Steps to Reproduce:
1. Switch to developer mode 2. Select Topology 3. Select a project that has completed cron jobs like openshift-image-registry 4. Click the green CronJob Object 5. Observe Crash
Actual results:
The Topology screen crashes with error "Oh no! Something went wrong."
Expected results:
After clicking the completed pod / workload, the screen should display the information related to it.
Additional info:
The error bellow was solved in this PR https://github.com/openshift/hypershift/pull/4723, but we can do a better sanitisation of the IgnitionServer payload. This is the suggestion from Alberto in Slack: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1726257008913779?thread_ts=1726241321.475839&cid=G01QS0P2F6W
✗ [High] Cross-site Scripting (XSS) Path: ignition-server/cmd/start.go, line 250 Info: Unsanitized input from an HTTP header flows into Write, where it is used to render an HTML page returned to the user. This may result in a Reflected Cross-Site Scripting attack (XSS).
Description of problem:
when viewing binary type of secret data, we are also providing 'Reveal/Hide values' option which is redundant
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-22-123921
How reproducible:
Always
Steps to Reproduce:
1. create a Key/Value secret when the data is binary file, Workloads -> Secrets -> Create Key/value secret -> upload binary file as secret data -> Create 2. check data on Secret details page 3.
Actual results:
2. both options: Save file and Reveal/Hide Values are provided. But `Reveal/Hide values` button makes no sense since the data is binary file
Expected results:
2. Only show 'Save file' option for binary data
Additional info:
Please review the following PR: https://github.com/openshift/coredns/pull/130
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
HostedCluster dump not working anymore in OpenStack CI
Version-Release number of selected component (if applicable):
4.18 and 4.17
Description of problem:
In the case of OpenStack, the network operator tries and fails to update the infrastructure resource.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
always
Steps to Reproduce:
1. Install hypershift 2. Create openstack hosted cluster
Actual results:
Network operator fails to report as available due to: - lastTransitionTime: "2024-08-22T15:54:16Z" message: 'Error while updating infrastructures.config.openshift.io/cluster: failed to apply / update (config.openshift.io/v1, Kind=Infrastructure) /cluster: infrastructures.config.openshift.io "cluster" is forbidden: ValidatingAdmissionPolicy ''config'' with binding ''config-binding'' denied request: This resource cannot be created, updated, or deleted. Please ask your administrator to modify the resource in the HostedCluster object.' reason: UpdateInfrastructureSpecOrStatus status: "True" type: network.operator.openshift.io/Degraded
Expected results:
Cluster operator becomes available
Additional info:
This is a bug introduced with https://github.com/openshift/hypershift/pull/4303
Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/32
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem: Unnecessary warning notification message on debug pod.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
use an imageSetConfig with operator catalog
do a mirror to mirror
without removing the working-dir or the cache, do mirror to mirror again
It fails with error : filtered declarative config not found
We think that low disk space is likely the cause of https://issues.redhat.com/browse/OCPBUGS-37785
It's not immediately obvious that this happened during the run without digging into the events.
Could we create a new test to enforce that the kubelet never reports disk pressure during a run?
Rebase openshift/etcd to latest 3.5.17 upstream release.
Description of problem:
IHAC who is facing the same problem with OCPBUGS-17356 in OCP 4.16 cluster. There is no ContainerCreating pod and the alert firing appears to be a false positive.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.15
Additional info:
This is very similar to OCPBUGS-17356
Description of problem:
When we try to deleted the MachineOSConfig when it is still building state. Then the resources related to MOSC is deleted but not for configmap. And hence when we try to again apply the MOSC in same pool the status of MOSB is not properly generated. To resolve the issue we have to manually delete the resources created in configmap.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-07-10-022831 True False 3h15m Cluster version is 4.16.0-0.nightly-2024-07-10-022831
How reproducible:
Steps to Reproduce:
1. Create CustomMCP 2. Apply any MOSC 3. Delete the MOSC while it is still in building stage 4. Again apply the MOSC 5. Check the MOSB status oc get machineosbuilds. NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED infra-rendered-infra-371dc5d02dbe0bb5712857393db95bf3-builder False
Actual results:
oc get machineosbuilds NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED infra-rendered-infra-371dc5d02dbe0bb5712857393db95bf3-builder False
Expected results:
We should be able to see the status
Additional info:
Check the logs of machine-os-builder $ oc logs machine-os-builder-74d56b55cf-mp6mv | grep -i error I0710 11:05:56.750770 1 build_controller.go:474] Error syncing machineosbuild infra3: could not start build for MachineConfigPool infra: could not load rendered MachineConfig mc-rendered-infra-371dc5d02dbe0bb5712857393db95bf3 into configmap: configmaps "mc-rendered-infra-371dc5d02dbe0bb5712857393db95bf3" already exists
After looking at this test run we need to validate the following scenarios:
Do the monitor tests in openshift/origin accurately test these scenarios?
Please review the following PR: https://github.com/openshift/oc/pull/1866
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/images/pull/196
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ROSA HCP allows customers to select hostedcluster and nodepool OCP z-stream versions, respecting version skew requirements. E.g.:
Version-Release number of selected component (if applicable):
Reproducible on 4.14-4.16.z, this bug report demonstrates it for a 4.15.28 hostedcluster with a 4.15.25 nodepool
How reproducible:
100%
Steps to Reproduce:
1. Create a ROSA HCP cluster, which comes with a 2-replica nodepool with the same z-stream version (4.15.28) 2. Create an additional nodepool at a different version (4.15.25)
Actual results:
Observe that while nodepool objects report the different version (4.15.25), the resulting kernel version of the node is that of the hostedcluster (4.15.28) ❯ k get nodepool -n ocm-staging-2didt6btjtl55vo3k9hckju8eeiffli8 NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE mshen-hyper-np-4-15-25 mshen-hyper 1 1 False True 4.15.25 False False mshen-hyper-workers mshen-hyper 2 2 False True 4.15.28 False False ❯ k get no -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-139.us-west-2.compute.internal Ready worker 24m v1.28.12+396c881 10.0.129.139 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9 ip-10-0-129-165.us-west-2.compute.internal Ready worker 98s v1.28.12+396c881 10.0.129.165 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9 ip-10-0-132-50.us-west-2.compute.internal Ready worker 30m v1.28.12+396c881 10.0.132.50 <none> Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow) 5.14.0-284.79.1.el9_2.aarch64 cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
Expected results:
Additional info:
Description of problem:
When running the `make fmt` target in the repository the command can fail due to a mismatch of versions between the go language and the goimports dependency.
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
always
Steps to Reproduce:
1.checkout release-4.16 branch 2.run `make fmt`
Actual results:
INFO[2024-10-01T14:41:15Z] make fmt make[1]: Entering directory '/go/src/github.com/openshift/cluster-cloud-controller-manager-operator' hack/goimports.sh go: downloading golang.org/x/tools v0.25.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.25.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local)
Expected results:
successful completion of `make fmt`
Additional info:
our goimports.sh script file reference `goimports@latest` which means that this problem will most likely affect older branches as well. we will need to set a specific version of the goimports package for those branches. given that the CCCMO includes golangci-lint, and uses it for a test, we should include goimports through golangci-lint which will solve this problem without needing special versions of goimports.
Description of problem:
OLM 4.17 references 4.16 catalogs
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. oc get pods -n openshift-marketplace -o yaml | grep "image: registry.redhat.io"
Actual results:
image: registry.redhat.io/redhat/certified-operator-index:v4.16 image: registry.redhat.io/redhat/certified-operator-index:v4.16 image: registry.redhat.io/redhat/community-operator-index:v4.16 image: registry.redhat.io/redhat/community-operator-index:v4.16 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16 image: registry.redhat.io/redhat/redhat-operator-index:v4.16 image: registry.redhat.io/redhat/redhat-operator-index:v4.16
Expected results:
image: registry.redhat.io/redhat/certified-operator-index:v4.17 image: registry.redhat.io/redhat/certified-operator-index:v4.17 image: registry.redhat.io/redhat/community-operator-index:v4.17 image: registry.redhat.io/redhat/community-operator-index:v4.17 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17 image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17 image: registry.redhat.io/redhat/redhat-operator-index:v4.17 image: registry.redhat.io/redhat/redhat-operator-index:v4.17
Additional info:
Description of problem:
With the Configuring a private storage endpoint on Azure by enabling the Image Registry Operator to discover VNet and subnet names[1], if creating cluster with internal Image Registry, it will create a storage account with private endpoint, so once the new pvc using the same skuName with this private storage account, it will hit the mount permission issue. [1] https://docs.openshift.com/container-platform/4.16/post_installation_configuration/configuring-private-cluster.html#configuring-private-storage-endpoint-azure-vnet-subnet-iro-discovery_configuring-private-cluster
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
Creating cluster with flexy job: aos-4_17/ipi-on-azure/versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm profile and specify enable_internal_image_registry: "yes" Create pod and pvc with azurefile-csi sc
Actual results:
pod failed to up due to mount error: mount //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 on /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount failed with mount failed: exit status 32 Mounting command: mount Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,gid=1018570000,file_mode=0777,dir_mode=0777, //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount Output: mount error(13): Permission denied
Expected results:
Pod should be up
Additional info:
We can have some simple WA like using storageclass with networkEndpointType: privateEndpoint or specify another storage account, but using the pre-defined storageclass azurefile-csi will fail. And the automation is not easy to walk around. I'm not sure if CSI Driver could check if the reused storage account has the private endpoint before using the existing storage account.
Description of problem:
Running "make fmt" in the repository fails with an error about a version mismatch between goimports and the go language version.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. checkout release-4.16 branch 2. run "make fmt" (with golang version 1.21)
Actual results:
openshift-hack/check-fmt.sh go: downloading golang.org/x/tools v0.26.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.26.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local) make: *** [openshift.mk:18: fmt] Error 1
Expected results:
completion without errors
Additional info:
this is affecting us currently with 4.16 and previous, but will become a persistent problem over time. we can correct this by using a holistic approach such as calling goimports from the binary that is included in our build images.
Please review the following PR: https://github.com/openshift/csi-operator/pull/114
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Machine got stuck in Provisioning phase after the EC2 gets terminated by AWS. The scenario I got this problem was when running an rehearsal cluster in a under development[1] job[2] for AWS Local Zone. The EC2 created through MachineSet template was launched in the Local Zone us-east-1-qro-1a, but the instance was terminated right after it was created with this message[3] (From AWS Console): ~~~ Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 87040 GiB in this location. Please contact AWS Support for more information. ~~~ When I saw this problem in the Console, I removed the Machine object and the MAPI was able to create a new instance in the same Zone: ~~~ $ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init) sh-4.4$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx Provisioning 45m sh-4.4$ oc delete machine ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx -n openshift-machine-api machine.machine.openshift.io "ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" deleted (...) $ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init) sh-4.4$ oc get machines -n openshift-machine-api -w NAME PHASE TYPE REGION ZONE AGE ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-v675j Provisioned c5.2xlarge us-east-1 us-east-1-qro-1a 2m6s ~~~ The job[2] didn't finish successfully due the timeout checking for node readiness, but the Machine got provisioned correctly (without Console errors) and kept in running state. The main problem I can see in the logs of Machine Controller is an endless loop trying to reconcile an terminated machine/instance (i-0fc8f2e7fe7bba939): ~~~ 2023-06-20T19:38:01.016776717Z I0620 19:38:01.016760 1 controller.go:156] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling Machine 2023-06-20T19:38:01.016776717Z I0620 19:38:01.016767 1 actuator.go:108] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator checking if machine exists 2023-06-20T19:38:01.079829331Z W0620 19:38:01.079800 1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.132099118Z E0620 19:38:01.132063 1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.132099118Z I0620 19:38:01.132080 1 reconciler.go:296] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Instance does not exist 2023-06-20T19:38:01.132146892Z I0620 19:38:01.132096 1 controller.go:349] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling machine triggers idempotent create 2023-06-20T19:38:01.132146892Z I0620 19:38:01.132101 1 actuator.go:81] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator creating machine 2023-06-20T19:38:01.132489856Z I0620 19:38:01.132460 1 reconciler.go:41] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: creating machine 2023-06-20T19:38:01.190935211Z W0620 19:38:01.190901 1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.238693678Z E0620 19:38:01.238661 1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.238693678Z I0620 19:38:01.238680 1 machine_scope.go:90] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: patching machine 2023-06-20T19:38:01.249796760Z E0620 19:38:01.249761 1 actuator.go:72] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx error: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue 2023-06-20T19:38:01.249824958Z W0620 19:38:01.249796 1 controller.go:351] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: failed to create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue 2023-06-20T19:38:01.249858967Z E0620 19:38:01.249847 1 controller.go:324] "msg"="Reconciler error" "error"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue" "controller"="machine-controller" "name"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" "namespace"="openshift-machine-api" "object"={"name":"ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx","namespace":"openshift-machine-api"} "reconcileID"="8890f9f7-2fbf-441d-a8b7-a52ec5f4ae2f" ~~~ I also reviewed the Account quotas for EBS gp2 and we are under the limits. The second machine was also provisioned, so I would discard any account quotas, and focus on the capacity issues in the Zone - considering Local Zone does not have high capacity as regular zones, it could happen more frequently. I am asking the AWS teams a RCA, asking more clarification how we can programatically get this error (maybe EC2 API, I didn't described the EC2 when the event happened). [1] https://github.com/openshift/release/pull/39902#issuecomment-1599559108 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39902/rehearse-39902-pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones/1671215930459295744 [3] https://user-images.githubusercontent.com/3216894/247285243-3cd28306-2972-4576-a9a6-a620e01747a6.png
Version-Release number of selected component (if applicable):
4.14.0-0.ci.test-2023-06-20-191559-ci-op-ljs7pd35-latest
How reproducible:
- Rarely by AWS (mainly in zone capacity issues - a RCA has been requested to AWS to check if we can find options to reproduce)
Steps to Reproduce:
this is hard to reproduce as the EC2 had been terminated by AWS. I created one script to watch the specific subnet ID and terminate any instances created on it instantaneously, but the Machine is going to the Failed phase and getting stuck on it - and not the "Provisioning" as we got in the CI job. Steps to try to reproduce: 1. Create a cluster with Local Zone support: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-localzone.html 2. Wait for the cluster be created 3. Scale down the MachineSet for the Local Zone 4. Start a new Terminal(#2): watch and terminate EC2 instance created in an Local Zone subnet (example: us-east-1-bue-1a) ~~~ machineset_monitor="byonet1-sc9fb-edge-us-east-1-bue-1a" # discover the subnet ID subnet_id=$(oc get machineset $machineset_monitor -n openshift-machine-api -o json | jq -r .spec.template.spec.providerSpec.value.subnet.id) # discover the zone name zone_name="$(aws ec2 describe-subnets --subnet-ids $subnet_id --query 'Subnets[].AvailabilityZone' --output text)" # Discover instance ids in the subnet and terminate it while true; do echo "$(date): Getting instance in the zone ${zone_name} / subnet ${subnet_id}..." instance_ids=$(aws ec2 describe-instances --filters Name=subnet-id,Values=$subnet_id Name=instance-state-name,Values=pending,running,shutting-down,stopping --query 'Reservations[].Instances[].InstanceId' --output text) echo "$(date): Instances retrieved: $instance_ids" if [[ -n "$instance_ids" ]]; then echo "Terminating instances..." aws ec2 terminate-instances --instance-ids $instance_ids sleep 1 else echo "Awaiting..." sleep 2 fi done ~~~ 4. Scale up the MachineSet 5. Observe the Machines
Actual results:
Expected results:
- Machine moved to Failed phase when EC2 is terminated by AWS, or - maybe self-recover the Machine when EC2 is deleted/terminated by deleting the Machine object when managed by a MachineSet, so we can prevent manual steps
Additional info:
Description of problem:
release-4.18 of openshift/cloud-provider-openstack should be based off upstream release-1.31 branch.
Description of problem:
Multiple monitoring-plugin Pods return the response code every 10s, there will be too many logs as time goes by
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-150616
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
% oc -n openshift-monitoring logs monitoring-plugin-76b8c847f6-m872m time="2024-09-10T07:55:52Z" level=info msg="enabled features: []\n" module=main time="2024-09-10T07:55:52Z" level=warning msg="cannot read config file, serving plugin with default configuration, tried /etc/plugin/config.yaml" error="open /etc/plugin/config.yaml: no such file or directory" module=server time="2024-09-10T07:55:52Z" level=info msg="listening on https://:9443" module=server 10.128.2.2 - - [10/Sep/2024:07:55:53 +0000] "GET /health HTTP/2.0" 200 2 10.128.2.2 - - [10/Sep/2024:07:55:58 +0000] "GET /health HTTP/2.0" 200 2 10.128.2.2 - - [10/Sep/2024:07:56:08 +0000] "GET /health HTTP/2.0" 200 2 10.128.2.2 - - [10/Sep/2024:07:56:18 +0000] "GET /health HTTP/2.0" 200 2 10.128.2.2 - - [10/Sep/2024:07:56:28 +0000] "GET /health HTTP/2.0" 200 2 10.128.2.2 - - [10/Sep/2024:07:56:38 +0000] "GET /health HTTP/2.0" 200 2 ... $ oc -n openshift-monitoring logs monitoring-plugin-76b8c847f6-m872m | grep "GET /health HTTP/2.0" | wc -l 1967
Expected results:
Before we switched to the golang backend, there are usually not many logs
Additional info:
Description of problem:
Running https://github.com/shiftstack/installer/blob/master/docs/user/openstack/README.md#openstack-credentials-update leads to cinder pvc stuck in terminating status:
$ oc get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE cinder-test pvc-0 Terminating pvc-d7d37d04-d8d1-4a61-a3bc-c038e53a13c7 1Gi RWO standard-csi <unset> 12h cinder-test pvc-1 Terminating pvc-32049f0e-b842-4e54-aff8-5f41f51b3c54 1Gi RWO standard-csi <unset> 12h cinder-test pvc-2 Terminating pvc-3eb42d8a-f22f-418b-881e-21c913b89c56 1Gi RWO standard-csi <unset> 12h
The cinder-csi-controller reports below error:
E1022 07:21:11.772540 1 utils.go:95] [ID:4401] GRPC error: rpc error: code = Internal desc = DeleteVolume failed with error Expected HTTP response code [202 204] when accessing [DELETE https://10.46.44.159:13776/v3/c27fbb9d859e40cc9 6f82e47b5ceebd6/volumes/bd5e6cf9-f27e-4aff-81ac-a83e7bccea86], but got 400 instead: {"badRequest": {"code": 400, "message": "Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer."}}
However, in openstack, they appears in-use:
stack@undercloud-0 ~]$ OS_CLOUD=shiftstack openstack volume list
/usr/lib/python3.9/site-packages/osc_lib/utils/__init__.py:515: DeprecationWarning: The usage of formatter functions is now discouraged. Consider using cliff.columns.FormattableColumn instead. See reviews linked with bug 1687955 for more
detail.
warnings.warn(
+--------------------------------------+------------------------------------------+-----------+------+------------------------------------------------------+
| ID | Name | Status | Size | Attached to |
+--------------------------------------+------------------------------------------+-----------+------+------------------------------------------------------+
| 093b14c1-a79a-46aa-ab6b-6c71d2adcef9 | pvc-3eb42d8a-f22f-418b-881e-21c913b89c56 | in-use | 1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdd |
| 4342c947-732d-4d23-964c-58bd56b79fd4 | pvc-32049f0e-b842-4e54-aff8-5f41f51b3c54 | in-use | 1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdc |
| 6da3147f-4ce8-4e17-a29a-6f311599a969 | pvc-d7d37d04-d8d1-4a61-a3bc-c038e53a13c7 | in-use | 1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdb |
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-21-010606 RHOS-17.1-RHEL-9-20240701.n.1
How reproducible:
Always (twice in a row)
Additional info:
must-gather provided in private comment
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1083
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" is failing
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-10-195326 $ oc version Client Version: 4.18.0-202410080912.p0.g3692450.assembly.stream-3692450 Kustomize Version: v5.4.2 Server Version: 4.18.0-0.nightly-2024-10-10-195326 Kubernetes Version: v1.31.1
How reproducible:
Always
Steps to Reproduce:
1. Execute the "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" with the right oc binary for the tested version 2. 3.
Actual results:
The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" command fails with this error: $ oc adm ocp-certificates regenerate-machine-config-server-serving-cert W1011 10:13:41.951040 2699876 recorder_logging.go:53] &Event{ObjectMeta:{dummy.17fd5e657c5748ca dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:SecretUpdateFailed,Message:Failed to update Secret/: Secret "machine-config-server-tls" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-11 10:13:41.950941386 +0000 UTC m=+0.377199185,LastTimestamp:2024-10-11 10:13:41.950941386 +0000 UTC m=+0.377199185,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} The Secret "machine-config-server-tls" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable
Expected results:
The command should be executed without errors
Additional info:
This line is repeated many times, about once a second when provisioning a new cluster:
level=debug msg= baremetalhost resource not yet available, will retry
Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/79
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/1113
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/548
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oc/pull/1867
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/118
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
must-gather creates empty monitoring/prometheus/rules.json file due to error "Unable to connect to the server: x509: certificate signed by unknown authority"
Version-Release number of selected component (if applicable):
4.9
How reproducible:
not sure what customer did on certs
Steps to Reproduce:
1. 2. 3.
Actual results:
monitoring/prometheus/rules.json is empty, while monitoring/prometheus/rules.sterr contains error message "Unable to connect to the server: x509: certificate signed by unknown authority"
Expected results:
as must-gather runs inside the cluster only it should be safe to ignore any certificate check when data is queried from Prometheus
Additional info:
https://attachments.access.redhat.com/hydra/rest/cases/03329385/attachments/e89af78a-3e35-4f1a-a13c-46f05ff755cc?usePresignedUrl=true should contain an example
High flake rate on new EnsureValidatingAdmissionPolicies e2e tests
EnsureValidatingAdmissionPoliciesDontBlockStatusModifications
EnsureValidatingAdmissionPoliciesCheckDeniedRequests
EnsureValidatingAdmissionPoliciesExists
High concentration on quickly completing test clusters like `TestNoneCreateCluster` and `TestHAEtcdChaos`
Please review the following PR: https://github.com/openshift/origin/pull/29071
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Because SNO replaces the api-server during an upgrade, the storage-operator's csi-snapshot-container exits because it can retreive a CR, causing an exit loop back-off for the period where the api-server is down, this also effects other tests during this same time frame. We will be resolving each one of these individually and updating the tests for the time being to unblock the problems.
Additional context here:
https://redhat-internal.slack.com/archives/C0763QRRUS2/p1728567187172169
Description of problem:
When deploying nodepools on OpenStack, the Nodepool condition complains about unsupported amd64 while we actually support it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.
From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.
Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.
inter 1s fall 2 rise 3
and
readinessProbe: httpGet: scheme: HTTPS port: 6443 path: readyz initialDelaySeconds: 0 periodSeconds: 5 timeoutSeconds: 10 successThreshold: 1 failureThreshold: 3
We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following
2024-10-08T12:37:32.779247039Z [WARNING] (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.
much faster than k8s would consider something as wrong.
In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.
Description of problem:
periodics are failing due to a change in coreos.
Version-Release number of selected component (if applicable):
4.15,4.16,4.17,4.18
How reproducible:
100%
Steps to Reproduce:
1. Check any periodic conformance jobs 2. 3.
Actual results:
periodic conformance fails with hostedcluster creation
Expected results:
periodic conformance test suceeds
Additional info:
We want to use crun as default in 4.18, but upstream cri-o switched before we're ready.
Description of problem:
Console user settings are saved in a ConfigMap for each user in the namespace openshift-console-user-settings.
The console frontend uses the k8s API to read and write that ConfigMap. The console backend creates a ConfigMap with a Role and RoleBinding for each user, giving that single user read and write access to his/her own ConfigMap.
The number of Role and RoleBindings might decrease a cluster performance. This has happened in the past, esp. on the Developer Sandbox, where a long-living cluster creates new users that is then automatically removed after a month. Keeping the Role and RoleBinding results in performance issues.
The resources had an ownerReference before 4.15 so that the 3 resources (1 ConfigMap, 1 Role, 1 RoleBinding) was automatically removed when the User resource was deleted. This ownerReference was removed with 4.15 to support external OIDC providers.
The ask in this issue is to restore that ownerReference for the OpenShift auth provider.
History:
See also:
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
Always
Steps to Reproduce:
Actual results:
The three resources weren't deleted after the user was deleted.
Expected results:
The three resources should be deleted after the user is deleted.
Additional info:
-> While upgrading the cluster from 4.13.38 -> 4.14.18, it is stuck on CCO, clusterversion is complaining about
"Working towards 4.14.18: 690 of 860 done (80% complete), waiting on cloud-credential".
While checking further we see that CCO deployment is yet to rollout.
-> ClusterOperator status.versions[name=operator] isn't a narrow "CCO Deployment is updated", it's "the CCO asserts the whole CC component is updated", which requires (among other things) a functional CCO Deployment. Seems like you don't have a functional CCO Deployment, because logs have it stuck talking about asking for a leader lease. You don't have Kube API audit logs to say if it's stuck generating the Lease request, or waiting for a response from the Kube API server.
Description of problem:
My customer is trying to install OCP 4.15 IPv4/v6 dual stack with IPv6 primary using IPI-OpenStack (platform: openstack) on OSP 17.1. However, it fails with the following error ~~~ $ ./openshift-install create cluster --dir ./ : ERROR: Bootstrap failed to complete: Get "https://api.openshift.example.com:6443/version": dial tcp [2001:db8::5]:6443: i/o timeout ~~~ On the bootstrap node, the VIP "2001:db8::5" is not set. ~~~ $ ip addr : 2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether aa:aa:aa:aa:aa:aa brd ff:ff:ff:ff:ff:ff inet 10.0.0.3/24 brd 10.0.0.254 scope global dynamic noprefixroute enp3s0 valid_lft 40000sec preferred_lft 40000sec inet6 2001:db8::3/128 scope global noprefixroute valid_lft forever preferred_lft forever inet6 fe80::aaaa:aaff:feaa:aaaa/64 scope link noprefixroute valid_lft forever preferred_lft forever ~~~ As far as I investigated, the reason why VIP is not set is that "nameserver" is not properly set on /etc/resolv.conf. Because of this, name resolution doesn't work on the bootstrap node. ~~~ $ cat /etc/resolv.conf # Generated by NetworkManager nameserver 127.0.0.1 search openshift.example.com ~~~ ==> There should be a nameserver entry which is advertised by DHCPv6 or DHCPv4. However, there is only 127.0.0.1 /var/run/NetworkManager/resolv.conf has a proper "nameserver" entry which is advertised by DHCPv6: ~~~ # cat /var/run/NetworkManager/resolv.conf # Generated by NetworkManager search openshift.example.com nameserver 2001:db8::8888 ~~~ In IPI-openstack installation, /etc/resolv.conf is generated from /var/run/NetworkManager/resolv.conf by the following script: https://github.com/openshift/installer/blob/9938156e81b5c0085774b2ec56a4be075413fd2d/data/data/bootstrap/openstack/files/etc/NetworkManager/dispatcher.d/30-local-dns-prepender I'm wondering if the above script doesn't work well due to timing issue, race condition or something like that. And according to the customer, this issue depends on DNS setting. - When DNS server info is advertised only by IPv4 DHCP: The issue occurs - When DNS server info is advertised only by IPv6 DHCP: The issue occurs - When DNS server info is advertised by both IPv4 and IPv6 DHCP: The issue does NOT occurs
Version-Release number of selected component (if applicable):
OCP 4.15 IPI-OpenStack
How reproducible:
Steps to Reproduce:
1. Create a provider network on OSP 17.1 2. Create IPv4 subnet and IPv6 subnet on the provider network 3. Create set dns-nameserver setting using "openstack subnet set --dns-nameserver" command only on either of IPv4 subnet or IPv6 subnet 4. Run IPI-OpenStack installation on the provider network
Actual results:
IPI-openstack installation fails because nameserver of /etc/resolv.conf on bootstrap node is not set properly
Expected results:
IPI-openstack installation succeeds and nameserver of /etc/resolv.conf on bootstrap node is set properly
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/8965
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/331
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Modify the import to strip or change the bootOptions.efiSecureBootEnabled
https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319
archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}
ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil
defer f.Close()
// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil
return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}
ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil
Description of problem:
OCP UI enabled ES and FR recently and a new memsource project template was created for the upload operation. So we need to update the memsource-upload.sh script to make use of the new project template ID.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/322
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-cluster-control-plane-machine-set-operator-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
User could filter route with status on OCP 4.16 and before version, but this filter disappeared on OCP 4.17 and 4.18.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-10-09-114619 4.18.0-0.nightly-2024-10-09-113533
How reproducible:
Always
Steps to Reproduce:
1.Check to routes list page. 2. 3.
Actual results:
1. There is not filter with status field.
Expected results:
1. Should have filter with status field. Refer to filter on 4.16: https://drive.google.com/file/d/1j0QdO98cMy0ots8rtHdB82MSWilxkOGr/view?usp=drive_link
Additional info:
Description of problem:
When creating sample application from OCP Dev Console, The deployments, services, roues get created but it does not create any BuildConfigs for the application and hence the application throws: ImagePullBackOff: Back-off pulling image "nodejs-sample:latest"
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. OCP Web Console -> Developer Mode -> Add -> Samples -> Select Any "Builder Images" type Application -> Create 2. Check BuildConfig for this application. 3.
Actual results:
No BuildConfig gets created.
Expected results:
Application should create a build and the image should be available for the application deployment.
Additional info:
Please review the following PR: https://github.com/openshift/node_exporter/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
update the tested instance type for IBMCloud
Version-Release number of selected component (if applicable):
4.17
How reproducible:
1. Some new instance type need to be added 2. match the memory and cpu limitation
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://docs.openshift.com/container-platform/4.16/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html#installation-ibm-cloud-tested-machine-types_installing-ibm-cloud-customizations
Description of problem:
When an IDP name contains whitespaces, it causes the oauth-server to panic, if Golang is v1.22 or higher.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster with OCP 4.17 2. Create IDP with whitespaces in the name. 3. oauth-server panics.
Actual results:
oauth-server panics (if Go is at version 1.22 or higher).
Expected results:
NO REGRESSION, it worked with Go 1.21 and lower.
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/35
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
the first time we try to clear the input value on Expand PVC modal doesn't set input value to zero, instead the value is cleared and set to 1 we need clear again then the input value will be 0
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-01-053925
How reproducible:
Always
Steps to Reproduce:
1. create a PVC with size 300MiB, and make sure it's in Bound status 2. goes to PVC details -> Actions -> Expand PVC, select the input value and press 'backspace/delete' button
Actual results:
2. the input value is set to 1
Expected results:
2. the input value should be set to 0 on a clear action
Additional info:
screenshot https://drive.google.com/file/d/1Y-FwiCndGpnR6A8ZR1V9weumBi2xzcp0/view?usp=drive_link
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/852
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/165
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently
Version-Release number of selected component (if applicable):
4.14
How reproducible:
in a statistically significant pattern
Steps to Reproduce:
1. run OCP test suite many times for it to matter
Actual results:
fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times
Expected results:
Test pass
Additional info:
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
Description of problem:
On the operator hub page, the operator is not showing and is getting the following error message: "Oh no! Something went wrong." TypeError: Description: A.reduce is not a function
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to the operator hub page on the web console. 2. 3.
Actual results:
"Oh no! Something went wrong."
Expected results:
Should list all the operators.
Additional info:
Description of problem:
when we view list page with 'All Projects' selected, it is not showing all Ingress resources
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-212926
How reproducible:
Always
Steps to Reproduce:
1. create Ingress under different project $ oc get ingress -A NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE 39-3 example-2 <none> example.com 80 8m56s default example <none> example.com 80 9m43s 2. goes to Networking -> Ingresses -> choose 'All Projects' 3.
Actual results:
2. Only one Ingress resource listed
Expected results:
2. Should list Ingresses from all projects
Additional info:
Description of problem:
In cri-o the first interface in a CNI result is used as the Pod.IP in Kubernetes. In net-attach-def client lib version 1.7.4, we use the first CNI result as the "default=true" interface as noted in the network-status, this is problematic for CNV along with OVN-K UDN, as it needs to know that the UDN interface is the default=true
Version-Release number of selected component (if applicable):
4.18,4.17
How reproducible:
Reproduction only under specific circumstances without an entire OVN-K stack. Therefore, use: https://gist.github.com/dougbtv/a97e047c9872b2a40d275bb27af85789 in order to validate this functionality, which requires installing a custom CNI plugin using the script in the gist named 'z-dummy-cni-script.sh' create that as /var/lib/cni/bin/dummyresult on a host, and make it executable, and then make sure it's on the same node you label with multusdebug=true
Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/178
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Observing a CI test where the metal3 Pod is deleted and allowed to recreate on another host, it took 5 attempts to start the new pod because static-ip-manager was crashlooping with the following log:
+ '[' -z 172.22.0.3/24 ']' + '[' -z enp1s0 ']' + '[' -n enp1s0 ']' ++ ip -o addr show dev enp1s0 scope global + [[ -n 2: enp1s0 inet 172.22.0.134/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0\ valid_lft 3sec preferred_lft 3sec ]] + ip -o addr show dev enp1s0 scope global + grep -q 172.22.0.3/24 ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24" + echo 'ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24"' + exit 1
The error message is misleading about what is actually checked (apart from the whole subnet/subset typo). It doesn't appear this should ever work for IPv4, since we don't ever expect the Provisioning VIP to appear on the interface before we've set it. (With IPv6 this should often work thanks to an appalling and unsafe hack. Not to suggest that grepping for an IPv4 address complete with .'s in it is safe either.)
Eventually the pod does start up, with this in the log:
+ '[' -z 172.22.0.3/24 ']' + '[' -z enp1s0 ']' + '[' -n enp1s0 ']' ++ ip -o addr show dev enp1s0 scope global + [[ -n '' ]] + /usr/sbin/ip address flush dev enp1s0 scope global + /usr/sbin/ip addr add 172.22.0.3/24 dev enp1s0 valid_lft 300 preferred_lft 300
So essentially this only worked because there are no IP addresses on the provisioning interface.
In the original (error) log the machine's IP 172.22.0.134/24 has a valid lifetime of 3s, so that likely explains why it later disappears. The provisioning network is managed, so the IP address comes from dnsmasq in the former incarnation of the metal3 pod. We effectively prevent the new pod from starting until the DHCP addresses have timed out, even though we will later flush them to ensure no stale ones are left behind.
The check was originally added by https://github.com/openshift/ironic-static-ip-manager/pull/27 but that only describes what it does and not the reason. There's no linked ticket to indicate what the purpose was.
Description of problem:
pseudolocalizes navigation test is failing due to https://github.com/openshift/networking-console-plugin/issues/46 and CI is blocked. We discussed this as a team and believe the best option is to remove this test so that future plugin changes do not block CI.
Description of problem:
'Remove alternate Service' button doesn't remove alternative service edit section
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-09-120947
How reproducible:
Always
Steps to Reproduce:
1. goes to Routes creation form, Networking -> Routes -> Create Route -> Form view 2. click on 'Add alternate Service' 3. click on 'Remove alternate Service'
Actual results:
3. alternative service edit section can not be removed and since these fields are mandatory, user can not create Route successfully unless he must choose one alternate service, otherwise user will see error Error "Required value" for field "spec.alternateBackends[0].name".
Expected results:
clicking on 'Remove alternate Service' button should remove alternate service edit section
Additional info:
Description of problem:
according to doc https://docs.openshift.com/container-platform/4.16/storage/understanding-persistent-storage.html#pv-access-modes_understanding-persistent-storage
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-08-135628
How reproducible:
Always
Steps to Reproduce:
1. goes to PVC creation page and select a storageclass whose privisioner is `file.csi.azure.com` 2. check access mode dropdown values
Actual results:
ROX is disabled
Expected results:
ROX should be enabled, all access modes should be enabled
Additional info:
Description of problem:
When installing OpenShift 4.16 on vSphere using IPI method with a template it fails with below error: 2024-08-07T09:55:51.4052628Z "level=debug msg= Fetching Image...", 2024-08-07T09:55:51.4054373Z "level=debug msg= Reusing previously-fetched Image", 2024-08-07T09:55:51.4056002Z "level=debug msg= Fetching Common Manifests...", 2024-08-07T09:55:51.4057737Z "level=debug msg= Reusing previously-fetched Common Manifests", 2024-08-07T09:55:51.4059368Z "level=debug msg=Generating Cluster...", 2024-08-07T09:55:51.4060988Z "level=info msg=Creating infrastructure resources...", 2024-08-07T09:55:51.4063254Z "level=debug msg=Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202406251923-0/x86_64/rhcos-416.94.202406251923-0-vmware.x86_64.ova?sha256=893a41653b66170c7d7e9b343ad6e188ccd5f33b377f0bd0f9693288ec6b1b73'", 2024-08-07T09:55:51.4065349Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4066994Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4068612Z "level=debug msg=image download content length: 12169", 2024-08-07T09:55:51.4070676Z "level=error msg=failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to use cached vsphere image: bad status: 403"
Version-Release number of selected component (if applicable):
4.16
How reproducible:
All the time in user environment
Steps to Reproduce:
1.Try to install disconnected IPI install on vSphere using a template. 2. 3.
Actual results:
No cluster installation
Expected results:
Cluster installed with indicated template
Additional info:
- 4.14 works as expected in customer environment - 4.15 works as expected in customer environment
Description of problem:
Adjust OVS Dynamic Pinning tests to hypershift. Port 7_performance_kubelet_node/cgroups.go and 7_performance_kubelet_node/kubelet.go to hypershift
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug is created to port test cases to 4.17 branch
Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/74
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The term "Label" and "Selector" for pod selector is confusing in NetworkPolicies form. Suggestion: 1. change the term accordingly Label -> Key Selector -> Value 2. redunce the length of the input dialog
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
On 4.17, ABI jobs fail with error level=debug msg=Failed to register infra env. Error: 1 error occurred: level=debug msg= * mac-interface mapping for interface eno12399np0 is missing
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-05-24-193308
How reproducible:
On Prow CI ABI jobs, always
Steps to Reproduce:
1. Generate ABI ISO starting with an agent-config file defining multiple network interfaces with `enabled: false` 2. Boot the ISO 3. Wait for error
Actual results:
Install fails with error 'mac-interface mapping for interface xxxx is missing'
Expected results:
Install completes
Additional info:
The check fails on the 1st network interface defined with `enabled: false` Prow CI ABI Job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808 agent-config.yaml: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808/artifacts/baremetal-pxe-ha-agent-ipv4-static-connected-f14/baremetal-lab-agent-install/artifacts/agent-config.yaml
Description of problem:
When you update the IngressController's Scope on PowerVS, Alibaba Cloud, or OpenStack, a Progressing status condition is added that only says: "The IngressController scope was changed from "Internal" to "External" It's missing the instructions we see on AWS which begin with "To effectuate this change, you must delete the service..." These platforms do NOT have mutable scope (meaning you must delete the service to effectuate), so the instructions should be included.
Version-Release number of selected component (if applicable):
4.12+
How reproducible:
100%
Steps to Reproduce:
1. On PowerVS, Alibaba Cloud, or OpenStack, create an IngressController 2. Now change the scope of ingresscontroller.spec.endpointPublishingStrategy.loadBalancer.scope
Actual results:
Missing "To effectuate this change, you must delete the service..." instructions
Expected results:
Should contain "To effectuate this change, you must delete the service..." instructions
Additional info:
Prometheus HTTP API provides POST endpoints to fetch metrics: https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries
Those endpoints are used in the go client: https://github.com/prometheus/client_golang/blob/main/api/prometheus/v1/api.go#L1438
So a viewer-only program/user relying on the go client, or using these POST endpoints to fetch metrics, currently needs to create an additional Role+Binding in that purpose [1]
It would be much more convenient if that permission was directly included in the existing cluster-monitoring-view role, since it's actually used for reading.
[1]Role+Binding example
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: metrics rules: - verbs: - create apiGroups: - metrics.k8s.io resources: - pods --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: metrics subjects: - kind: User apiGroup: rbac.authorization.k8s.io name: test roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: metrics
[internal] cf slack discussion here https://redhat-internal.slack.com/archives/C0VMT03S5/p1724684997333529?thread_ts=1715862728.898369&cid=C0VMT03S5
Description of problem:
when testing unrelased OCP versions NP fails with - lastTransitionTime: "2024-11-18T07:11:20Z" message: 'Failed to get release image: the latest version supported is: "4.18.0". Attempting to use: "4.19.0-0.nightly-2024-11-18-064347"' observedGeneration: 1 reason: ValidationFailed status: "False" type: ValidReleaseImage We should allow for skipping NP image validation with the hypershift.openshift.io/skip-release-image-validation annotation
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1.Try to create a NP with 4.19 payload 2. 3.
Actual results:
Expected results:
Additional info:
In OCPBUGS-38414, a new featuregate was turned on that didn't work correctly on metal (or at least it's tests didn't). Metal should have techpreview jobs to ensure new features are tested properly. I think the right matrix is:
On standard CI jobs, we incorporate this by wiring in the appropriate FEATURE_SET variable, but metal jobs don't currently have a way to do this as far as I can tell.
These should be release informers.
https://github.com/openshift/release/blob/5ce4d77a6317479f909af30d66bc0285ffd38dbd/ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh#L63-L68 is the relevant step
Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Start last run option from the Action menu does not work on the BuildConfig details page
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create workloads with with builds 2. Goto the builds page from navigation 3. Select the build config 4. Select the` Start last run` option from the Action menu
Actual results:
The option doesn't work
Expected results:
The option should work
Additional info:
Attaching video
https://drive.google.com/file/d/10shQqcFbIKfE4Jv60AxNYBXKz08EdUAK/view?usp=sharing
The hypershift team has reported a nil pointer dereference causing a crash when attempting to call the validation method on an NTO performance profile.
This was detected as the hypershift team was attempting to complete a revendoring under OSASINFRA-3643
Appears to be fallout from https://github.com/openshift/cluster-node-tuning-operator/pull/1086
Error:
--- FAIL: TestGetTuningConfig (0.02s) --- FAIL: TestGetTuningConfig/gets_a_single_valid_PerformanceProfileConfig (0.00s) panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x2ec0fc3] goroutine 329 [running]: testing.tRunner.func1.2({0x31c21e0, 0x651c8c0}) /home/emilien/sdk/go1.22.0/src/testing/testing.go:1631 +0x3f7 testing.tRunner.func1() /home/emilien/sdk/go1.22.0/src/testing/testing.go:1634 +0x6b6 panic({0x31c21e0?, 0x651c8c0?}) /home/emilien/sdk/go1.22.0/src/runtime/panic.go:770 +0x132 github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2.(*PerformanceProfile).getNodesList(0xc000fa4000) /home/emilien/git/github.com/shiftstack/hypershift/vendor/github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2/performanceprofile_validation.go:594 +0x2a3 github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2.(*PerformanceProfile).ValidateBasicFields(0xc000fa4000) /home/emilien/git/github.com/shiftstack/hypershift/vendor/github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2/performanceprofile_validation.go:132 +0x65 github.com/openshift/hypershift/hypershift-operator/controllers/nodepool.validateTuningConfigManifest({0xc000f34a00, 0x1ee, 0x200}) /home/emilien/git/github.com/shiftstack/hypershift/hypershift-operator/controllers/nodepool/nto.go:237 +0x307 github.com/openshift/hypershift/hypershift-operator/controllers/nodepool.(*NodePoolReconciler).getTuningConfig(0xc000075cd8, {0x50bf5f8, 0x65e4e40}, 0xc000e05408) /home/emilien/git/github.com/shiftstack/hypershift/hypershift-operator/controllers/nodepool/nto.go:187 +0x834 github.com/openshift/hypershift/hypershift-operator/controllers/nodepool.TestGetTuningConfig.func1(0xc000e07a00) /home/emilien/git/github.com/shiftstack/hypershift/hypershift-operator/controllers/nodepool/nto_test.go:459 +0x297 testing.tRunner(0xc000e07a00, 0xc000693650) /home/emilien/sdk/go1.22.0/src/testing/testing.go:1689 +0x21f created by testing.(*T).Run in goroutine 325 /home/emilien/sdk/go1.22.0/src/testing/testing.go:1742 +0x826
Description of problem:
Specify long cluster name in install-config, ============== metadata: name: jima05atest123456789test123 Create cluster, installer exited with below error: 08-05 09:46:12.788 level=info msg=Network infrastructure is ready 08-05 09:46:12.788 level=debug msg=Creating storage account 08-05 09:46:13.042 level=debug msg=Collecting applied cluster api manifests... 08-05 09:46:13.042 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating storage account jima05atest123456789tsh586sa: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima05atest123456789t-sh586-rg/providers/Microsoft.Storage/storageAccounts/jima05atest123456789tsh586sa 08-05 09:46:13.042 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.042 level=error msg=RESPONSE 400: 400 Bad Request 08-05 09:46:13.043 level=error msg=ERROR CODE: AccountNameInvalid 08-05 09:46:13.043 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.043 level=error msg={ 08-05 09:46:13.043 level=error msg= "error": { 08-05 09:46:13.043 level=error msg= "code": "AccountNameInvalid", 08-05 09:46:13.043 level=error msg= "message": "jima05atest123456789tsh586sa is not a valid storage account name. Storage account name must be between 3 and 24 characters in length and use numbers and lower-case letters only." 08-05 09:46:13.043 level=error msg= } 08-05 09:46:13.043 level=error msg=} 08-05 09:46:13.043 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.043 level=error 08-05 09:46:13.043 level=info msg=Shutting down local Cluster API controllers... 08-05 09:46:13.298 level=info msg=Stopped controller: Cluster API 08-05 09:46:13.298 level=info msg=Stopped controller: azure infrastructure provider 08-05 09:46:13.298 level=info msg=Stopped controller: azureaso infrastructure provider 08-05 09:46:13.298 level=info msg=Shutting down local Cluster API control plane... 08-05 09:46:15.177 level=info msg=Local Cluster API system has completed operations See azure doc[1], the naming rules on storage account name, it must be between 3 and 24 characters in length and may contain numbers and lowercase letters only. The prefix of storage account created by installer seems changed to use infraID with CAPI-based installation, it's "cluster" when installing with terraform. Is it possible to change back to use "cluster" as sa prefix to keep consistent with terraform? because there are several storage accounts being created once cluster installation is completed. One is created by installer starting with "cluster", others are created by image-registry starting with "imageregistry". And QE has some CI profiles[2] and automated test cases relying on installer sa, need to search prefix with "cluster", and not sure if customer also has similar scenarios. [1] https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview [2] https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L241
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Similar to the work done for AWS STS and Azure WIF support, the console UI (specifically OperatorHub) needs to:
CONSOLE-3776 was adding filtering for the GCP WIP case, for the operator-hub tile view. Part fo the change was also check for the annotation which indicates that the operator supports GCP's WIF:
features.operators.openshift.io/token-auth-gcp: "true"
AC:
Please review the following PR: https://github.com/openshift/oc/pull/1870
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The AWS EFS CSI Operator primarily passes credentials to the CSI driver using environment variables. However, this practice is discouraged by the OCP Hardening Guide.
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/50
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-vsphere-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
Attempting to Migrate from OpenShiftSDN to OVNKubernetes but experiencing the below Error once the Limited Live Migration is started.
+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h I0829 14:06:20.313928 82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf I0829 14:06:20.314202 82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}} F0829 14:06:20.315468 82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"
The OpenShift Container Platform 4 - Cluster has been installed with the below configuration and therefore has a conflict because of the clusterNetwork with the Join Subnet of OVNKubernetes.
$ oc get cm -n kube-system cluster-config-v1 -o yaml
apiVersion: v1
data:
install-config: |
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: sandbox1730.opentlc.com
compute:
- architecture: amd64
hyperthreading: Enabled
name: worker
platform: {}
replicas: 3
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform: {}
replicas: 3
metadata:
creationTimestamp: null
name: nonamenetwork
networking:
clusterNetwork:
- cidr: 100.64.0.0/15
hostPrefix: 23
machineNetwork:
- cidr: 10.241.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 198.18.0.0/16
platform:
aws:
region: us-east-2
publish: External
pullSecret: ""
So following the procedure, the below steps were executed but still the problem is being reported.
oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.68.0.0/16"}}}}}'
Checking whether change was applied and one can see it being there/configured.
$ oc get network.operator cluster -o yaml apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2024-08-29T10:05:36Z" generation: 376 name: cluster resourceVersion: "135345" uid: 37f08c71-98fa-430c-b30f-58f82142788c spec: clusterNetwork: - cidr: 100.64.0.0/15 hostPrefix: 23 defaultNetwork: openshiftSDNConfig: enableUnidling: true mode: NetworkPolicy mtu: 8951 vxlanPort: 4789 ovnKubernetesConfig: egressIPConfig: {} gatewayConfig: ipv4: {} ipv6: {} routingViaHost: false genevePort: 6081 ipsecConfig: mode: Disabled ipv4: internalJoinSubnet: 100.68.0.0/16 mtu: 8901 policyAuditConfig: destination: "null" maxFileSize: 50 maxLogFiles: 5 rateLimit: 20 syslogFacility: local0 type: OpenShiftSDN deployKubeProxy: false disableMultiNetwork: false disableNetworkDiagnostics: false kubeProxyConfig: bindAddress: 0.0.0.0 logLevel: Normal managementState: Managed migration: mode: Live networkType: OVNKubernetes observedConfig: null operatorLogLevel: Normal serviceNetwork: - 198.18.0.0/16 unsupportedConfigOverrides: null useMultiNetworkPolicy: false
Following the above the Limited Live Migration is being triggered, which then suddently stops because of the Error shown.
oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.9
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 with OpenShiftSDN, the configuration shown above and then update to OpenShift Container Platform 4.16
2. Change internalJoinSubnet to prevent a conflict with the Join Subnet of OVNKubernetes (oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":
}}}}')
3. Initiate the Limited Live Migration running oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
4. Check the logs of ovnkube-node using oc logs ovnkube-node-XXXXX -c ovnkube-controller
Actual results:
+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h I0829 14:06:20.313928 82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf I0829 14:06:20.314202 82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}} F0829 14:06:20.315468 82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"
Expected results:
OVNKubernetes Limited Live Migration to recognize the change applied for internalJoinSubnet and don't report any CIDR/Subnet overlap during the OVNKubernetes Limited Live Migration
Additional info:
N/A
Affected Platforms:
OpenShift Container Platform 4.16 on AWS
Description of problem:
TestIngressControllerNamespaceSelectorUpdateShouldClearRouteStatus failed due to previously seen issue with using a outdated IngressController object on update: router_status_test.go:248: failed to update ingresscontroller: Operation cannot be fulfilled on ingresscontrollers.operator.openshift.io "ic-namespace-selector-test": the object has been modified; please apply your changes to the latest version and try again
Version-Release number of selected component (if applicable):
4.12-4.17
How reproducible:
<5% (Seen only once)
Steps to Reproduce:
1. Run TestIngressControllerNamespaceSelectorUpdateShouldClearRouteStatus on a busy cluster with other tests in parallel until it fails
Actual results:
Flake
Expected results:
No flake
Additional info:
"7 runs, 57% failed, 25% of failures match = 14% impact"
I think we should address all possible "Operation cannot be fulfilled on ingresscontroller" flakes together.
Description of problem:
webpack dependency in @openshift-console/dynamic-plugin-sdk-webpack package is listed as "5.75.0" i.e. not a semver range but an exact version.
If a plugin project updates its webpack dependency to a newer version, it may cause the package manager to not hoist node_modules/@openshift/dynamic-plugin-sdk-webpack (which is a dependency of the ☝️ package) which then causes problems during the webpack build.
Steps to Reproduce:
1. git clone https://github.com/kubevirt-ui/kubevirt-plugin 2. modify webpack dependency in package.json to a newer version 3. yarn install # missing node_modules/@openshift/dynamic-plugin-sdk-webpack 4. yarn build # results in build errors due to ^^
Actual results:
Build errors due to missing node_modules/@openshift/dynamic-plugin-sdk-webpack
Expected results:
No build errors
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/76
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The quorum checker starts to become very difficult to maintain and we're having a lot more problems with concurrent controllers as identified in OCPBUGS-31849.
To avoid plastering the code in all places where a revision rollout could happen, we should invert the control and tell the revision controller when we do not want to have a rollout at all.
Links to some of the discussions:
AC:
Add precondition to the revision controller - this would halt the whole revision process
Description of problem:
The following logs are from namespaces/openshift-apiserver/pods/apiserver-6fcd57c747-57rkr/openshift-apiserver/openshift-apiserver/logs/current.log
2024-06-06T15:57:06.628216833Z E0606 15:57:06.628186 1 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 139.823053ms, panicked: true, err: <nil>, panic-reason: runtime error: invalid memory address or nil pointer dereference 2024-06-06T15:57:06.628216833Z goroutine 192790 [running]: 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:105 +0xa5 2024-06-06T15:57:06.628216833Z panic({0x498ac60?, 0x74a51c0?}) 2024-06-06T15:57:06.628216833Z runtime/panic.go:914 +0x21f 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).importImages(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0xc07055f4a0, 0xc0a2487600) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:263 +0x1cf5 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).Import(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0x0?, 0x0?) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:110 +0x139 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport.(*REST).Create(0xc0033b2240, {0x5626bb0, 0xc0a50c7dd0}, {0x5600058?, 0xc07055f4a0?}, 0xc08e0b9ec0, 0x56422e8?) 2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport/rest.go:337 +0x1574 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.(*namedCreaterAdapter).Create(0x55f50e0?, {0x5626bb0?, 0xc0a50c7dd0?}, {0xc0b5704000?, 0x562a1a0?}, {0x5600058?, 0xc07055f4a0?}, 0x1?, 0x2331749?) 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:254 +0x3b 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:184 +0xc6 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.2() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:209 +0x39e 2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1() 2024-06-06T15:57:06.628216833Z k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:117 +0x84
Version-Release number of selected component (if applicable):
We applied into all clusters in CI and checked 3 of them and all 3 share the same errors.
oc --context build09 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.3 True False 3d9h Error while reconciling 4.16.0-rc.3: the cluster operator machine-config is degraded oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-rc.2 True False 15d Error while reconciling 4.16.0-rc.2: the cluster operator machine-config is degraded oc --context build03 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.16 True False 34h Error while reconciling 4.15.16: the cluster operator machine-config is degraded
How reproducible:
We applied this PR https://github.com/openshift/release/pull/52574/files to the clusters.
It breaks at least 3 of them.
"qci-pull-through-cache-us-east-1-ci.apps.ci.l2s4.p1.openshiftapps.com" is a registry cache server https://github.com/openshift/release/blob/master/clusters/app.ci/quayio-pull-through-cache/qci-pull-through-cache-us-east-1.yaml
Additional info:
There are lots of image imports in OpenShift CI jobs.
It feels like the registry cache server returns unexpected results to the openshift-apiserver:
2024-06-06T18:13:13.781520581Z E0606 18:13:13.781459 1 strategy.go:60] unable to parse manifest for "sha256:c5bcd0298deee99caaf3ec88de246f3af84f80225202df46527b6f2b4d0eb3c3": unexpected end of JSON input
Our theory is that the requests of imports from all CI clusters crashed the cache server and it sent some unexpected data which caused apiserver to panic.
The expected behaviour is that if the image cannot be pulled from the first mirror in the ImageDigestMirrorSet, then it will be failed over to the next one.
Description of problem:
Navigation: Storage -> StorageClasses -> Create StorageClass -> Provisioner -> kubernetes.io/gce-pd Issue: "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-01-063526
How reproducible:
Always
Steps to Reproduce:
1. Log into web console and set language to non en_US 2. Navigate to 3. Storage -> StorageClasses -> Create StorageClass -> Provisioner 4. Select Provisioner "kubernetes.io/gce-pd" 5. "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English
Actual results:
Content is in English
Expected results:
Content should be in set language.
Additional info:
Screenshot reference attached
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/313
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
We added the following to openstack-kubelet-nodename.service: https://github.com/openshift/machine-config-operator/pull/4570 But wait-for-br-ex-up.service is disabled, so doesn't normally do anything. This is why it doesn't break anything on other platforms, even though it's never going to work the way we are currently configuring workers for HCP. However, this Wants directive enables it when openstack-kubelet-nodename is added to the systemd transaction, so adding it broke us. "Wants" adds it to the transaction and it hangs. If it failed it would be fine, but it doesn't. It also adds a RequiredBy on node-valid-hostname. "br-ex" is up but it doesn't matter because that's not what it's testing. It's testing that /run/nodeip-configuration/br-ex-up exists, which it won't because it's written by /etc/NetworkManager/dispatcher.d/30-resolv-prepender, which is empty.
Version-Release number of selected component (if applicable):
4.18
Component Readiness has found a potential regression in the following test:
[sig-node] node-lifecycle detects unexpected not ready node
Extreme regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 84.62%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-10-29T00:00:00Z
End Time: 2024-11-05T23:59:59Z
Success Rate: 84.62%
Successes: 33
Failures: 6
Flakes: 0
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 100.00%
Successes: 79
Failures: 0
Flakes: 0
cns-migration tool should check for supported versions of vcenter before starting migration of CNS volumes.
Description of problem:
node-joiner tool does not honour additionalNTPSources As mentioned in https://docs.openshift.com/container-platform/4.16/installing/installing_with_agent_based_installer/installation-config-parameters-agent.html the setting of additionalNTPSources is possible when adding nodes at day1, but the setting is not honoured at day2
How reproducible:
always
Steps to Reproduce:
Create a agent config with AdditionalNTPSources: - 10.10.10.10 - 10.10.10.11 hosts: - hostname: extra-worker-0 interfaces: - name: eth0 macAddress: 0xDEADBEEF - hostname: extra-worker-1 interfaces: - name: eth0 macAddress: 00:02:46:e3:9e:8c - hostname: 0xDEADBEEF interfaces: - name: eth0 macAddress: 0xDEADBEEF
Actual results:
NTP on added node cannot join the NTP server. ntp-synced Status:failure Message:Host couldn't synchronize with any NTP server
NTP on added node can contact the NTP server.
Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/574
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The version page in our docs is out of date and needs to be updated with the current versioning standards we expect.
Minimum of OCP mgmt cluster/k8s needs to be added.
Please review the following PR: https://github.com/openshift/images/pull/192
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run. This card captures image-registry operator that blips Degraded=True during upgrade runs. Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade/1843366977876267008 Reasons associated with the blip: ProgressDeadlineExceeded, NodeCADaemonControllerError For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in. Exception can be found here: https://github.com/openshift/origin/blob/fd6fe36319c39b51ab0f02ecb8e2777c0e1bb210/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L319 See linked issue for more explanation on the effort.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After click "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines, if now click "Lightspeed" popup button at the right bottom, the highlighted rectangle lines lay above the popup modal.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-150616
How reproducible:
Always
Steps to Reproduce:
1.Clicked "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines. At the same time, click "Lightspeed" popup button at the right bottom. 2. 3.
Actual results:
1. The highlighted rectangle lines lay above the popup modal. Screenshot: https://drive.google.com/drive/folders/15te0dbavJUTGtqRYFt-rM_U8SN7euFK5?usp=sharing
Expected results:
1. The Lightspeed popup modal should be on the top layer.
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The openshift-ingress/router-default never stops reconciling in the ingress operator.
2024-08-22T15:59:22.789Z INFO operator.ingress_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:22.799Z INFO operator.status_controller controller/controller.go:114 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:22.868Z INFO operator.ingress_controller ingress/deployment.go:135 updated router deployment {"namespace": "openshift-ingress", "name": "router-default", "diff": " &v1.Deployment{\n \tTypeMeta: {},\n \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"6244\", ...},\n \tSpec: v1.DeploymentSpec{\n \t\tReplicas: &1,\n \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType: \"RollingUpdate\",\n+ \t\t\tType: \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n \t\t},\n \t\tMinReadySeconds: 30,\n \t\tRevisionHistoryLimit: &10,\n \t\t... // 2 identical fields\n \t},\n \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n }\n"} 2024-08-22T15:59:22.884Z ERROR operator.ingress_controller controller/controller.go:114 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)"} 2024-08-22T15:59:24.820Z INFO operator.ingress_controller handler/enqueue_mapped.go:103 queueing ingress {"name": "default", "related": ""} 2024-08-22T15:59:24.820Z INFO operator.ingress_controller handler/enqueue_mapped.go:103 queueing ingress {"name": "default", "related": ""} 2024-08-22T15:59:24.820Z INFO operator.ingress_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.887Z INFO operator.ingress_controller ingress/deployment.go:135 updated router deployment {"namespace": "openshift-ingress", "name": "router-default", "diff": " &v1.Deployment{\n \tTypeMeta: {},\n \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"7194\", ...},\n \tSpec: v1.DeploymentSpec{\n \t\tReplicas: &1,\n \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType: \"RollingUpdate\",\n+ \t\t\tType: \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n \t\t},\n \t\tMinReadySeconds: 30,\n \t\tRevisionHistoryLimit: &10,\n \t\t... // 2 identical fields\n \t},\n \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n }\n"} 2024-08-22T15:59:24.911Z INFO operator.route_metrics_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.911Z INFO operator.status_controller controller/controller.go:114 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.911Z INFO operator.certificate_controller controller/controller.go:114 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.911Z INFO operator.ingressclass_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.913Z INFO operator.ingress_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.924Z INFO operator.status_controller controller/controller.go:114 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:24.984Z INFO operator.ingress_controller ingress/deployment.go:135 updated router deployment {"namespace": "openshift-ingress", "name": "router-default", "diff": " &v1.Deployment{\n \tTypeMeta: {},\n \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"7194\", ...},\n \tSpec: v1.DeploymentSpec{\n \t\tReplicas: &1,\n \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType: \"RollingUpdate\",\n+ \t\t\tType: \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n \t\t},\n \t\tMinReadySeconds: 30,\n \t\tRevisionHistoryLimit: &10,\n \t\t... // 2 identical fields\n \t},\n \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n }\n"} 2024-08-22T15:59:43.457Z INFO operator.ingress_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T15:59:43.539Z INFO operator.ingress_controller ingress/deployment.go:135 updated router deployment {"namespace": "openshift-ingress", "name": "router-default", "diff": " &v1.Deployment{\n \tTypeMeta: {},\n \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"7194\", ...},\n \tSpec: v1.DeploymentSpec{\n \t\tReplicas: &1,\n \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType: \"RollingUpdate\",\n+ \t\t\tType: \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n \t\t},\n \t\tMinReadySeconds: 30,\n \t\tRevisionHistoryLimit: &10,\n \t\t... // 2 identical fields\n \t},\n \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n }\n"} 2024-08-22T16:01:07.866Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:07.866Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:07.866Z INFO operator.route_metrics_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T16:01:07.870Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:07.870Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:07.870Z INFO operator.route_metrics_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T16:01:07.899Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:07.899Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:07.899Z INFO operator.route_metrics_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2024-08-22T16:01:08.957Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:08.957Z INFO operator.route_metrics_controller handler/enqueue_mapped.go:103 queueing ingresscontroller {"name": "default"} 2024-08-22T16:01:08.957Z INFO operator.route_metrics_controller controller/controller.go:114 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}}
Version-Release number of selected component (if applicable):
4.17
The diff is:
❯ cat /tmp/msg.json | jq -r '.diff' &v1.Deployment{ TypeMeta: {}, ObjectMeta: {Name: "router-default", Namespace: "openshift-ingress", UID: "6cf98392-8782-4741-b5c9-ce63fb77879a", ResourceVersion: "6244", ...}, Spec: v1.DeploymentSpec{ Replicas: &1, Selector: &{MatchLabels: {"ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default"}}, Template: {ObjectMeta: {Labels: {"ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default", "ingresscontroller.operator.openshift.io/hash": "9c69cc8d"}, Annotations: {"target.workload.openshift.io/management": `{"effect": "PreferredDuringScheduling"}`}}, Spec: {Volumes: {{Name: "default-certificate", VolumeSource: {Secret: &{SecretName: "default-ingress-cert", DefaultMode: &420}}}, {Name: "service-ca-bundle", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: "service-ca-bundle"}, Items: {{Key: "service-ca.crt", Path: "service-ca.crt"}}, DefaultMode: &420, Optional: &false}}}, {Name: "stats-auth", VolumeSource: {Secret: &{SecretName: "router-stats-default", DefaultMode: &420}}}, {Name: "metrics-certs", VolumeSource: {Secret: &{SecretName: "router-metrics-certs-default", DefaultMode: &420}}}}, Containers: {{Name: "router", Image: "registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f"..., Ports: {{Name: "http", ContainerPort: 80, Protocol: "TCP"}, {Name: "https", ContainerPort: 443, Protocol: "TCP"}, {Name: "metrics", ContainerPort: 1936, Protocol: "TCP"}}, Env: {{Name: "DEFAULT_CERTIFICATE_DIR", Value: "/etc/pki/tls/private"}, {Name: "DEFAULT_DESTINATION_CA_PATH", Value: "/var/run/configmaps/service-ca/service-ca.crt"}, {Name: "RELOAD_INTERVAL", Value: "5s"}, {Name: "ROUTER_ALLOW_WILDCARD_ROUTES", Value: "false"}, ...}, ...}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &3600, ...}}, Strategy: v1.DeploymentStrategy{ - Type: "RollingUpdate", + Type: "", - RollingUpdate: s"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}", + RollingUpdate: nil, }, MinReadySeconds: 30, RevisionHistoryLimit: &10, ... // 2 identical fields }, Status: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...}, }
Description of problem:
Based on the results in [Sippy|https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Etcd&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-19%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-13%2000%3A00%3A00&testId=Operator%20results%3A45d55df296fbbfa7144600dce70c1182&testName=operator%20conditions%20etcd], it appears that the periodic tests are not waiting for the etcd operator to complete before exiting. The test is supposed to wait for up to 20 mins after the final control plane machine is rolled, to allow operators to settle. But we are seeing the etcd operator triggering 2 further revisions after this happens. We need to understand if the etcd operator is correctly rolling out vs whether these changes should have rolled out prior to the final machine going away, and, understand if there's a way to add more stability to our checks to make sure that all of the operators stabilise, and, that they have been stable for at least some period (1 minute)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
On NetworkPolicies page, the position of the titles and the tab does not have the same look of other pages, it should have the same style with others, move the title above the tabs.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If an instance type is specified in the install-config.yaml, the installer will try to validate its availability in the given region and that it meets the minimum requirements for OCP nodes. When that happens, the `ec2:DescribeInstanceTypes` permission is used but it's not validated by the installer as a required permissions for installs.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
Always by setting an instanceType in the install-config.yaml
Steps to Reproduce:
1. 2. 3.
Actual results:
If you install with an user with minimal permissions, you'll get the error: level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.aws: Internal error: error listing instance types: fetching instance types: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-8phprrsm-ccf9a-minimal-perm is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action level=error msg= status code: 403, request id: 559344f4-0fc3-4a6c-a6ee-738d4e1c0099, compute[0].platform.aws: Internal error: error listing instance types: fetching instance types: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-8phprrsm-ccf9a-minimal-perm is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action level=error msg= status code: 403, request id: 584cc325-9057-4c31-bb7d-2f4458336605]
Expected results:
The installer fails with an explicit message saying that `ec2:DescribeInstanceTypes` is required.
Additional info:
Description of problem:
The cluster policy controller does not get the same feature flags that other components in the control plane are getting.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Create hosted cluster 2. Get cluster-policy-controller-config configmap from control plane namespace
Actual results:
Default feature gates are not included in the config
Expected results:
Feature gates are included in the config
Additional info:
This E2E tests whether the etcd is able to block the rollout of a new revision when the quorum is not safe.
Description of problem:
e980 is a valid system type for the madrid region but it is not listed as such in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy to mad02 with SysType set to e980 2. Fail 3.
Actual results:
Installer exits
Expected results:
Installer should continue as it's a valid system type.
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When user set Chinese launguage, check on OpenShift Lightspeed nav modal, the "Meet OpenShift Lightspeed" is translated to "OpenShift Lightspeed", "Meet" is not translated.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-10-133523
How reproducible:
Always
Steps to Reproduce:
1. When Chinese language is set, check the "Meet OpenShift Lightspeed" on OpenShift Lightspeed nav modal. 2. 3.
Actual results:
1. The "Meet OpenShift Lightspeed" is translated to "OpenShift Lightspeed", "Meet" is not translated.
Expected results:
1. "Meet" could be translated in Chinese. It has been translated for other languages.
Additional info:
The HyperShift codebase has numerous examples of MustParse*() functions being used on non-constant input. This is not their intended use, as any failure will cause a panic in the controller.
In a few cases they are are called on user-provided input, meaning any authenticated user can (intentionally or unintentionally) deny service to all other users by providing invalid input which continuously crashes the HostedCluster controller.
This is probably a security issue, but as I have already described it in https://github.com/openshift/hypershift/pull/4546 there is no reason to embargo it.
Description of problem:
Oc-mirror should not panic when failed to get release signature
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1) Mirror2disk+disk2mirror with following imagesetconfig, and mirror to enterprise registry : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.15 type: ocp minVersion: '4.15.18' maxVersion: '4.15.18' 2) Setup squid and only with white list with enterprise registry and the OSUS service ; cat /etc/squid/squid.conf http_port 3128 coredump_dir /var/spool/squid acl whitelist dstdomain "/etc/squid/whitelist" http_access allow whitelist http_access deny !whitelist cat /etc/squid/whitelist my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com -------------registry route (oc get route -n your registry app's project) update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-88.qe.devcluster.openshift.com ---osus route (oc get route -n openshift-update-service) Sudo systemctl restart squid export https_proxy=http://127.0.0.1:3128 export http_proxy=http://127.0.0.1:3128 3) Setting registry redirect with : cat ~/.config/containers/registries.conf [[registry]] location = "quay.io" insecure = false blocked = false mirror-by-digest-only = false prefix = "" [[registry.mirror]] location = "my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com" insecure = false 4) Use the same imagesetconfig and mirror to a new folder : `oc-mirror -c config-38037.yaml file://new-folder --v2`
Actual results:
4) the oc-mirror command panic with error :
I0812 06:45:26.026441 199941 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-417.qe.devcluster.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=a6097264-8b29-438f-9e71-4aba1e9ec32d
2024/08/12 06:45:26 [ERROR] : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=0f55261077557d1bb909c06b115e0c79b0025677be57ba2f045495c11e2443ee/signature-1": Forbidden
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3d1e3f6]
goroutine 1 [running]:
github.com/openshift/oc-mirror/v2/internal/pkg/release.SignatureSchema.GenerateReleaseSignatures({
, {0x4c7b348, 0x15}, {0xc000058c60, 0x1c, {...}, {...}, {...}, {...}, ...}, ..., ...}, ...)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/signature.go:97 +0x676
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*CincinnatiSchema).GetReleaseReferenceImages(0xc0007203c0, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/cincinnati.go:230 +0x70b
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*LocalStorageCollector).ReleaseImageCollector(0xc000b12388, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/local_stored_collector.go:58 +0x407
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).CollectAll(0xc000ae8908, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:955 +0x122
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).RunMirrorToDisk(0xc000ae8908, 0xc0005f3b08, {0xa?, 0x20?, 0x20?})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:707 +0x1aa
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).Run(0xc000ae8908, 0xc0005f1640?, {0xc0005f1640?, 0x0?, 0x0?})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:459 +0x149
github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc0005f3b08, {0xc0005f1640, 0x1, 0x4})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:207 +0x32a
github.com/spf13/cobra.(*Command).execute(0xc0005f3b08, {0xc000166010, 0x4, 0x4})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0xc0005f3b08)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x741ec38?)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13
main.main()
/home/fedora/yinzhou/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
Expected results:
The command could fail, but not panic
Description of problem:
Before the fix for https://issues.redhat.com/browse/OCPBUGS-42253 is merged upstream and propagated, we can apply a temporary fix directly in the samples operator repo, unblocking us from the need wait for that to happen.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1.oc new-app openshift/rails-postgresql-example 2. 3.
Actual results:
app pod in crash loop
Expected results:
app working
Additional info:
Description of the problem:
BE 2.35.1 - OCP 4.17 ARM64 cluster - Selecting CNV in UI throws the following error:
Local Storage Operator is not available when arm64 CPU architecture is selected
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.
Version-Release number of selected component (if applicable):
How reproducible:
Difficult to reproduce, might require CI signal
I talked with Gerd Oberlechner; the hack/app-sre/saas_template.yaml - it is not used anymore in app-interface.
It should be safe to remove this.
Please review the following PR: https://github.com/openshift/cluster-api-provider-metal3/pull/21
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Rebase openshift/etcd to latest 3.5.16 upstream release.
Description of problem:
The bug fixing of https://issues.redhat.com/browse/OCPBUGS-41184 introdcued the machine type validation error.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-10-14-021053
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", and then insert the machine type settings (see [1]) 2. "create manifests" (or "create cluster")
Actual results:
ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.gcp.type: Not found: "custom", compute[0].platform.gcp.type: Not found: "custom"]
Expected results:
Success
Additional info:
FYI the 4.17 PROW CI test failure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-mini-perm-custom-type-f28/1845589157397663744
The Telemetry userPreference added to the General tab in https://github.com/openshift/console/pull/13587 results in empty nodes being output to the DOM. This results in extra spacing any time a new user preference is added to the bottom of the General tab.
Description of problem:
The issue comes from https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25386451&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25386451. Error message is shown when gather bootstrap log bundle although log bundle gzip file is generated. ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected.
Version-Release number of selected component (if applicable):
4.17+
How reproducible:
Always
Steps to Reproduce:
1. Run `openshift-install gather bootstrap --dir <install-dir>` 2. 3.
Actual results:
Error message shown in output of command `openshift-install gather bootstrap --dir <install-dir>`
Expected results:
No error message shown there.
Additional info:
Analysis from Rafael, https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25387767&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25387767
After multi-VC changes were merged, now when we use this tool, following warnings get logged:
E0812 13:04:34.813216 13159 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors: line 1: cannot unmarshal !!seq into config.CommonConfigYAML I0812 13:04:34.813376 13159 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.
Which looks bit scarier than it should.
Description of problem:
storageNotConfiguredMessage contains link to https://docs.openshift.com/container-platform/%s/monitoring/configuring-the-monitoring-stack.html , which leads to 404, needs to be changed to https://docs.openshift.com/container-platform/%s/observability/monitoring/configuring-the-monitoring-stack.html
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
The fields in Shipwright build form show no hints or default values. They should provide examples and hints to help user provided correct values when creating a build.
For example:
Description of problem:
IHAC running 4.16.1 OCP cluster. In their cluster image registry pod is restarting with below messages:
message: "/image-registry/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:7629 +0x1d0\ngithub.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000a3c120, {0x28924c0, 0xc0001f5b20}, 0xc00083bab8, {0xc00125b7d1, 0x20}, {0x2866860, 0x1}, 0xc00120a8d0)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1135 +0x348\ngithub.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).Walk(0xc000675ec0?, {0x28924c0, 0xc0001f5b20}, {0xc000675ec0, 0x20}, 0xc00083bc10?)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1095 +0x148\ngithub.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk(0xc000519480, {0x2892778?, 0xc00012cf00?}, {0xc000675ec0, 0x20}, 0x1?)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/base/base.go:237 +0x237\ngithub.com/distribution/distribution/v3/registry/storage.getOutstandingUploads({0x2892778, 0xc00012cf00}, {0x289d728?, 0xc000519480})\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:70 +0x1f9\ngithub.com/distribution/distribution/v3/registry/storage.PurgeUploads({0x2892778, 0xc00012cf00}, {0x289d728?, 0xc000519480?}, {0xc1a937efcf6aec96, 0xfffddc8e973b8a89, 0x3a94520}, 0x1)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:34 +0x12d\ngithub.com/distribution/distribution/v3/registry/handlers.startUploadPurger.func1()\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:1139 +0x33f\ncreated by github.com/distribution/distribution/v3/registry/handlers.startUploadPurger in goroutine 1\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:1127 +0x329\n" reason: Error startedAt: "2024-08-27T09:08:14Z" name: registry ready: true restartCount: 250 started: true
Version-Release number of selected component (if applicable):
4.16.1
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
all the pods are restating
Expected results:
It should not restart.
Additional info:
https://redhat-internal.slack.com/archives/C013VBYBJQH/p1724761756273879 upstream report: https://github.com/distribution/distribution/issues/4358
Service :Labels, Pod selector, Location sorting doesn't work
Routes: all columns sorting doesn't work
Ingress: Host column sorting doesn't work
Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/426
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
add a new monitor test: api unreachable interval from client perspectives
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/559
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
4.15 and 4.16
$ oc explain prometheus.spec.remoteWrite.sendExemplars GROUP: monitoring.coreos.com KIND: Prometheus VERSION: v1FIELD: sendExemplars <boolean>DESCRIPTION: Enables sending of exemplars over remote write. Note that exemplar-storage itself must be enabled using the `spec.enableFeature` option for exemplars to be scraped in the first place. It requires Prometheus >= v2.27.0.
no `spec.enableFeature` option
$ oc explain prometheus.spec.enableFeature
GROUP: monitoring.coreos.com
KIND: Prometheus
VERSION: v1
error: field "enableFeature" does not exist
should be `spec.enableFeatures`
$ oc explain prometheus.spec.enableFeatures GROUP: monitoring.coreos.com KIND: Prometheus VERSION: v1FIELD: enableFeatures <[]string> DESCRIPTION: Enable access to Prometheus feature flags. By default, no features are enabled. Enabling features which are disabled by default is entirely outside the scope of what the maintainers will support and by doing so, you accept that this behaviour may break at any time without notice. For more information see https://prometheus.io/docs/prometheus/latest/feature_flags/
Version-Release number of selected component (if applicable):
4.15 and 4.16
How reproducible:
always
Description of problem:
When a user is trying to deploy a Hosted Cluster using Hypershift, If in the hostedCluster CR under Spec.Configuration.Proxy.HTTPSProxy there is defined a proxy URL missing the port (because uses the default port) this is gonna be passed with this code] inside the "kube-apiserver-proxy" yaml manifest under the spec.containers.command like below:
$ oc get pod n kube-system kube-apiserver-proxy-xxxxx -o yaml| yq '.spec.containers[].command' [ "control-plane-operator", "kubernetes-default-proxy", "listen-addr=172.20.0.1:6443", "proxy-addr=example.proxy.com", "-apiserver-addr=<apiserver-IP>:<port>" ]
Then this code will parse these values. Here]
This command have these flags that will be used for the container to do the API calls.
The net.Dial function that is used from the golang net package expects a host/ip:port. Check the docs here: https://pkg.go.dev/net#Dial
For TCP and UDP networks, the address has the form "host:port". The host must be a literal IP address, or a host name that can be resolved to IP addresses. The port must be a literal port number or a service name.
So the pod will end up having this issue:
2024-08-19T06:55:44.831593820Z {"level":"error","ts":"2024-08-19T06:55:44Z","logger":"kubernetes-default-proxy","msg":"failed diaing backend","proxyAddr":"example.proxy.com","error":"dial tcp: address example.proxy.com: missing port in address","stacktrace":"github.com/openshift/hypershift/kubernetes-default-proxy.(*server).run.func1\n\t/hypershift/kubernetes-default-proxy/kubernetes_default_proxy.go:89"}
Some ideas on how to solve his are below:
How reproducible:
Try to deploy a Hosted Cluster using Hypershift operator using a proxy URL without a port (e.g <example.proxy.com>:<port>) in the hostedCluster CR under "Spec.Configuration.Proxy.HTTPSProxy". This will result to the below error in the kube-apiserver-proxy container: "missing port in address"
Actual results:
The kube-apiserver-proxy container returns "missing port in address"
Expected results:
The kube-apiserver-proxy container to don't return "missing port in address"
Additional info:
This can be workarounded by adding a ":" and a port number after the proxy IP/URL in the hostedCluster."Spec.Configuration.Proxy.HTTPSProxy".
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/500
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.
There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!
Technically, this is a copy of STOR-1797, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.
When https://github.com/openshift/machine-config-operator/pull/4597 landed, bootstrap tests startup began to fail as it is doesn't install the required CRDs. This is because the CRDs no longer live in the MCO repo and the startup code needs to be reconciled to pick up the MCO specific CRDs from the o/api repo.
Please review the following PR: https://github.com/openshift/ironic-image/pull/539
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13. Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs. The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28 We have reproduced the issue and we found an ordering cycle error in the journal log Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.) Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.
Version-Release number of selected component (if applicable):
Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13: version: 4.13.0-0.nightly-2024-07-23-154444 version: 4.12.0-0.nightly-2024-07-23-230744 version: 4.11.59 version: 4.10.67 version: 4.9.59
How reproducible:
Always
Steps to Reproduce:
1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.
Actual results:
Nodes become not ready $ oc get nodes NAME STATUS ROLES AGE VERSION ci-op-g94jvswm-cc71e-998q8-master-0 Ready master 6h14m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-1 Ready master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-2 NotReady,SchedulingDisabled master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb NotReady,SchedulingDisabled worker 6h2m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6 Ready worker 6h4m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj Ready worker 6h6m v1.25.16+306a47e And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
Expected results:
No ordering cycle error should happen and the upgrade should be executed without problems.
Additional info:
Description of problem:
When machineconfig fails to generate, we set upgradeable=false and degrade pools. The expectation is that the CO would also degrade after some time (normally 30 minutes) since master pool is degraded, but that doesn't seem to be happening. Based on our initial investigation, the event/degrade is happening but it seems to be being cleared.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Should be always
Steps to Reproduce:
1. Apply a wrong config, such as a bad image.config object: spec: registrySources: allowedRegistries: - test.reg blockedRegistries: - blocked.reg 2. upgrade the cluster or roll out a new MCO pod 3. observe that pools are degraded but the CO isn't
Actual results:
Expected results:
Additional info:
Occasional machine-config daemon panics in test-preview. For example this run has:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736
And the referenced logs include a full stack trace, the crux of which appears to be:
E0801 19:23:55.012345 2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 127 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x2424b80?, 0x4166150?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0}) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65 github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208}) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
looks like ~15% impact in those CI runs CI Search turns up.
Run lots of CI. Look for MCD panics.
CI Search results above.
No hits.
Description of problem:
Infrastructure object with platform None is ignored by node-joiner tool
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. Run the node-joiner add-nodes command
Actual results:
Currently the node-joiner tool retrieves the platform type from the kube-system/cluster-config-v1 config map
Expected results:
Retrieve the platform type from the infrastructure cluster object
Additional info:
Description of problem:
All opentack-cinder-csi-driver-node pods are in crashloopback status during IPI installation with proxy configured:
2024-10-18 11:27:41.936 | NAMESPACE NAME READY STATUS RESTARTS AGE 2024-10-18 11:27:41.946 | openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-9dkwz 1/3 CrashLoopBackOff 61 (59s ago) 106m 2024-10-18 11:27:41.956 | openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-cdf2d 1/3 CrashLoopBackOff 53 (19s ago) 90m 2024-10-18 11:27:41.966 | openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-chnj6 1/3 CrashLoopBackOff 61 (85s ago) 106m 2024-10-18 11:27:41.972 | openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-fwgg4 1/3 CrashLoopBackOff 53 (32s ago) 90m 2024-10-18 11:27:41.979 | openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-h5qg8 1/3 CrashLoopBackOff 61 (88s ago) 106m 2024-10-18 11:27:41.989 | openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-jbzj9 1/3 CrashLoopBackOff 52 (42s ago) 90m
The pod complains with below:
2024-10-18T11:20:57.226298852Z W1018 11:20:57.226085 1 main.go:87] Failed to GetOpenStackProvider: Get "https://10.46.44.29:13000/": dial tcp 10.46.44.29:13000: i/o timeout
Looks it is not using the proxy to reach the OSP API.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-16-094159
Must-gather for 4.18 proxy installation (& must-gather for successful 4.17 proxy installation for comparison) in private comment.
After changing internalJoinSubnet,internalTransitSwitchSubnet, on day2 and do live migration. ovnkube node pod crashed
network part as below the service cidr has same subnet with the ovn default internalTransitSwitchSubnet
clusterNetwork: - cidr: 100.64.0.0/15 hostPrefix: 23 serviceNetwork: - 100.88.0.0/16
and then:
oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.82.0.0/16"}}}}}' oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalTransitSwitchSubnet": "100.69.0.0/16"}}}}}'
with error:
start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: EmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:100.254.0.0/17 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:
{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5}DisablePacke
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/376
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
For example, the toggle button on Node and Pods logs page don't have unique identifier, it's hard to locate these buttons during automation `Select a path` toggle button has definition <button class="pf-v5-c-menu-toggle" type="button" aria-label="Select a path" aria-expanded="false"> <span class="pf-v5-c-menu-toggle__text">openshift-apiserver</span> <span class="pf-v5-c-menu-toggle__controls">........... </button> `Select a log file` toggle button <button class="pf-v5-c-menu-toggle" type="button" aria-expanded="false"> <span class="pf-v5-c-menu-toggle__text">Select a log file </span><span class="pf-v5-c-menu-toggle__controls">....... </button> Since we have many toggle buttons on the page, it's quite hard to locate without distinguishable identifiers
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/126
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Placeholder for bumping CAPO in the installer.
Description of problem:
QE Liang Quan requested a review of https://github.com/openshift/origin/pull/28912 and the OWNERS file doesn't reflect current staff available to review.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
N/A
Steps to Reproduce:
1. 2. 3.
Actual results:
OWNERS file contains - danehans - frobware - knobunc - Miciah - miheer - sgreene570
Expected results:
Add new OWNERS as reviewers/approvers: - alebedev87 - candita - gcs278 - rfredette - Thealisyed - grzpiotrowski Move old OWNERS to emeritus_approvers: - danehans - sgreene570
Additional info:
Example in https://github.com/openshift/cluster-ingress-operator/blob/master/OWNERS
Component Readiness has found a potential regression in the following test:
operator conditions control-plane-machine-set
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0
Description of problem:
Operator is not getting installed. There are multiple install plans getting created/deleted for the same operator. There is not even any error indicated in the subscription or somewhere. The bundle unpacking job is completed.
Images: quay.io/nigoyal/odf-operator-bundle:v0.0.1 quay.io/nigoyal/odf-operator-catalog:v0.0.1
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
Create the below manifests --- apiVersion: v1 kind: Namespace metadata: labels: openshift.io/cluster-monitoring: "true" pod-security.kubernetes.io/audit: baseline pod-security.kubernetes.io/audit-version: v1.25 pod-security.kubernetes.io/enforce: baseline pod-security.kubernetes.io/enforce-version: v1.25 pod-security.kubernetes.io/warn: baseline pod-security.kubernetes.io/warn-version: v1.25 name: openshift-storage --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: odf-operatorgroup namespace: openshift-storage spec: targetNamespaces: - openshift-storage --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: odf-catalogsource namespace: openshift-storage spec: grpcPodConfig: securityContextConfig: legacy displayName: Openshift Data Foundation image: quay.io/nigoyal/odf-operator-catalog:v0.0.1 priority: 100 publisher: ODF sourceType: grpc --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: odf-subscription namespace: openshift-storage spec: channel: alpha name: odf-operator source: odf-catalogsource sourceNamespace: openshift-storage
Actual results:
Operator is not getting installed.
Expected results:
Operator should get installed.
Additional info:
The bundle is a unified bundle created from multiple bundles.
Slack Discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1726026365936859
Description of the problem:
When attempting to install a spoke cluster, the AgentClusterInstall is not being generated correctly due to release image certificate not being trusted
- lastProbeTime: "2024-08-20T20:10:16Z" lastTransitionTime: "2024-08-20T20:10:16Z" message: "The Spec could not be synced due to backend error: failed to get release image 'quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3'. Please ensure the releaseImage field in ClusterImageSet '4.17.0' is valid, (error: command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=false --icsp-file=/tmp/icsp-file98462205 quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3 --registry-config=/tmp/registry-config740495490' exited with non-zero exit code 1: \nFlag --icsp-file has been deprecated, support for it will be removed in a future release. Use --idms-file instead.\nerror: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3: Get \"https://quay.io/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\n)."
How reproducible:
Intermittent
Steps to reproduce:
1. Attempt to create cluster resources after assisted-service is running
Actual results:
AgentClusterInstall fails due to certificate errors
Expected results:
The registry housing the release image has it's certificate verified correctly
Additional Info:
Restarting the assisted-service pod fixes the issue. It seems like there is race condition between the operator setting up the configmap with the correct contents and the assisted pod starting and mounting the configmap to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/111
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-machine-api-provider-aws-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Description of problem:
While upgrading the cluster from web-console the below warning message observed. ~~~ Warning alert:Admission Webhook Warning ClusterVersion version violates policy 299 - "unknown field \"spec.desiredUpdate.channels\"", 299 - "unknown field \"spec.desiredUpdate.url\"" ~~~ There are no such fields in the clusterVersion yaml for which the warning message fired. From the documentation here: https://docs.openshift.com/container-platform/4.16/rest_api/config_apis/clusterversion-config-openshift-io-v1.html It's possible to see that "spec.desiredUpdate" exists, but there is no mention of values "channels" or "url" under desiredUpdate. Note: This is not impacting the cluster upgrade. However creating confusion among customers due to the warning message.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Everytime
Steps to Reproduce:
1. Install cluster of version 4.16.4 2. Upgrade the cluster from web-console to the next-minor version 3.
Actual results:
Upgrade should proceed with no such warnings
Expected results:
Additional info:
Description of problem:
Upon upgrade of 4.16.15, OLM is failing to upgrade operator cluster service versions due to a TLS validation error. From the OLM controller manager pod, logs show this: oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head "tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")" It's also observed in the api-server-operator logs that many webhooks are affected with the following errors: $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-8445495998-s6wgd | grep "failed to connect" | tail W1018 21:44:07.641047 1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority W1018 21:44:08.647623 1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority W1018 21:53:58.542660 1 degraded_webhook.go:147] failed to connect to webhook "clusterautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority This is causing the OLM controller to hang and is failing to install/upgrade operators based on the OLM controller logs.
How reproducible:
Very reproducible upon upgrade from 4.16.14 to 4.16.15 on any Openshift Dedicated or ROSA Openshfit cluster.
Steps to Reproduce:
1. Install OSD or ROSA cluster at 4.16.14 or below 2. Upgrade to 4.16.15 3. Attempt to install or upgrade operator via new ClusterServiceVersion
Actual results:
# API SERVER OPERATOR $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-666b796d8b-lqp56 | grep "failed to connect" | tail W1013 20:59:49.131870 1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc") W1013 20:59:50.147945 1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc") #OLM $ oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head 2024/10/13 12:00:08 http: TLS handshake error from 10.128.18.80:53006: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.") 2024/10/14 11:45:05 http: TLS handshake error from 10.130.19.10:36766: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")
Expected results:
no tls validation errors upon upgrade or installation of operators via OLM
Additional info:
Description of problem:
On route creation page, when check on "Secure Route", select "Edge" or "Re-encrypt" TLS termination, there is "TLS certificates for edge and re-encrypt termination. If not specified, the router's default certificate is used." under "Certificates". "router's" should be "router's"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-18-003538 4.18.0-0.nightly-2024-09-17-060032
How reproducible:
Always
Steps to Reproduce:
1.Check on route creation page, when check on "Secure Route", select "Edge" or "Re-encrypt" TLS termination. 2. 3.
Actual results:
1. There is "TLS certificates for edge and re-encrypt termination. If not specified, the router's default certificate is used.
Expected results:
1. "router's" should be "router's"
Additional info:
As an engineer I would like to have a functional test that make sure the ETCD recovery function works as expected without deploy a Full OCP or HostedCluster.
Alternatives:
Description of problem:
On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133 4.16.0-0.nightly-2024-08-08-111530
How reproducible:
Always
Steps to Reproduce:
1. Check overview page's getting started resources card, 2. 3.
Actual results:
1. There is "OpenShift LightSpeed" link in "Explore new features and capabilities"
Expected results:
1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
Additional info:
Description of the problem:
When provisioning a hosted cluster using a ZTP workflow to create BMH and NodePool CRs, corresponding agents are created for the BMHs, but those agents do not get added to the hostedCluster as they are not set to spec.approved=true
This is a recent change in behavior, and appears to be related to This commit meant to allow BMH CRs to be safely restored by OADP in DR scenarios.
Manual approval of the agents will result in a successful result.
Setting the PAUSE_PROVISIONED_BMHS boolean to false does result in a successful result.
How reproducible:
Always
Steps to reproduce:
1. Create BMH and NodePool for HostedCluster
2. Observe creation of agents on cluster
3. Observe agents do not join cluster
Actual results:
Agents exist, are not added to nodepool
Expected results:
Agents and their machines are added to the nodepool and the hosted cluster sees nodes appear.
Test is:
[sig-arch][Early] Operators low level operators should have at least the conditions we had in 4.17 [Suite:openshift/conformance/parallel]
Description of problem:
If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect
This does not occur if folder is defined.
An upstream bug was identified when debugging this:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/163
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/router/pull/623
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
For light theme, the Lightspeed logo should use the multi-color version. For dark theme, the Lightspeed logo should use the single color version for both the button and the content.
Description of problem:
Get "https://openshift.default.svc/.well-known/oauth-authorization-server": tls: failed to verify certificate: x509: certificate is valid for localhost, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, kube-apiserver, kube-apiserver.ocm-production-2b0eqpjq13aaba19ncgajh1asp39602g-faldana-hcp.svc, kube-apiserver.ocm-production-2b0eqpjq13aaba19ncgajh1asp39602g-faldana-hcp.svc.cluster.local, api.faldana-hcp.rvvd.p3.openshiftapps.com, api.faldana-hcp.hypershift.local, not openshift.default.svc
Version-Release number of selected component (if applicable):
4.15.9
How reproducible:
stable
Steps to Reproduce:
Get "https://openshift.default.svc/.well-known/oauth-authorization-server"
Actual results:
x509: certificate is valid for ... kubernetes.default.svc ..., not openshift.default.svc
Expected results:
OK
Additional info:
Works fine with ROSA Classic. The context: customer is configuring access to the RHACS console via Openshift Auth Provider. Discussion: https://redhat-internal.slack.com/archives/C028JE84N59/p1715048866276889
Description of problem:
When using an internal publishing strategy, the client is not properly initialized and will cause a code path to be hit which tries to access a field of a null pointer.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy a private cluster 2. segfault 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly *after* 2024-08-09-031511
How reproducible: always
Steps to Reproduce:
Actual results: the controller pods are not restarted
Expected results: the controller pods are restarted
Description of problem:
The AWS Cluster API Provider (CAPA) runs a required check to resolve the DNS Name for load balancers it creates. If the CAPA controller (in this case, running in the installer) cannot resolve the DNS record, CAPA will not report infrastructure ready. We are seeing in some cases, that installations running on local hosts (we have not seen this problem in CI) will not be able to resolve the LB DNS name record and the install will fail like this:
DEBUG I0625 17:05:45.939796 7645 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" namespace="openshift-cluster-api-guests" name="umohnani-4-16test-5ndjw" reconcileID="553beb3d-9b53-4d83-b417-9c70e00e277e" cluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" DEBUG Collecting applied cluster api manifests... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded
We do not know why some hosts cannot resolve these records, but it could be something like issues with the local DNS resolver cache, DNS records are slow to propagate in AWS, etc.
Version-Release number of selected component (if applicable):
4.16, 4.17
How reproducible:
Not reproducible / unknown -- this seems to be dependent on specific hosts and we have not determined why some hosts face this issue while others do not.
Steps to Reproduce:
n/a
Actual results:
Install fails because CAPA cannot resolve LB DNS name
Expected results:
As the DNS record does exist, install should be able to proceed.
Additional info:
Slack thread:
https://redhat-internal.slack.com/archives/C68TNFWA2/p1719351032090749
Description of problem:
When verifying OCPBUGS-38869 or in 4.18, the MOSB is still in updating state even though build pod is successfully removed and seeing error in machine-os build pod
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Apply any MOSC 2.see build pod is successful 3.But MOSB is still in updating state 4.And can see error in machine-os build pod
Actual results:
I have applied below MOSC
oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: abc spec: machineConfigPool: name: worker buildOutputs: currentImagePullSecret: name: $(oc get -n openshift-machine-config-operator sa default -ojsonpath='{.secrets[0].name}') buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" containerFile: - containerfileArch: noarch content: |- # Pull the centos base image and enable the EPEL repository. FROM quay.io/centos/centos:stream9 AS centos RUN dnf install -y epel-release # Pull an image containing the yq utility. FROM docker.io/mikefarah/yq:latest AS yq # Build the final OS image for this MachineConfigPool. FROM configs AS final # Copy the EPEL configs into the final image. COPY --from=yq /usr/bin/yq /usr/bin/yq COPY --from=centos /etc/yum.repos.d /etc/yum.repos.d COPY --from=centos /etc/pki/rpm-gpg/RPM-GPG-KEY-* /etc/pki/rpm-gpg/ # Install cowsay and ripgrep from the EPEL repository into the final image, # along with a custom cow file. RUN sed -i 's/\$stream/9-stream/g' /etc/yum.repos.d/centos*.repo && \ rpm-ostree install cowsay ripgrep EOF
$ oc get machineosconfig NAME AGE abc 45m $ oc logs build-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5 -f ... Copying blob sha256:a8157ed01dfc7fe15c8f2a86a3a5e30f7fcb7f3e50f8626b32425aaf821ae23d Copying config sha256:4b15e94c47f72b6c082272cf1547fdd074bd3539b327305285d46926f295a71b Writing manifest to image destination + return 0 $ oc get machineosbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED worker-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5-builder False True False False False $ oc logs machine-os-builder-654fc664bb-qvjkn | grep -i error I1003 16:12:52.463155 1 pod_build_controller.go:296] Error syncing pod openshift-machine-config-operator/build-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5: unable to update with build pod status: could not update MachineOSConfig"abc": MachineOSConfig.machineconfiguration.openshift.io "abc" is invalid: [observedGeneration: Required value, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Expected results:
MOSB should be successful
Additional info:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade/1836280498960207872 has a test for monitoring
The test needs to reliably produce results
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Component Readiness has found a potential regression in the following test:
[sig-network] pods should successfully create sandboxes by adding pod to network
Probability of significant regression: 96.41%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-27T00:00:00Z
End Time: 2024-09-03T23:59:59Z
Success Rate: 88.37%
Successes: 26
Failures: 5
Flakes: 12
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.46%
Successes: 43
Failures: 1
Flakes: 21
Here is an example run.
We see the following signature for the failure:
namespace/openshift-etcd node/master-0 pod/revision-pruner-11-master-0 hmsg/b90fda805a - 111.86 seconds after deletion - firstTimestamp/2024-09-02T13:14:37Z interesting/true lastTimestamp/2024-09-02T13:14:37Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-11-master-0_openshift-etcd_08346d8f-7d22-4d70-ab40-538a67e21e3c_0(d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57): error adding pod openshift-etcd_revision-pruner-11-master-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57" Netns:"/var/run/netns/97dc5eb9-19da-462f-8b2e-c301cfd7f3cf" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-etcd;K8S_POD_NAME=revision-pruner-11-master-0;K8S_POD_INFRA_CONTAINER_ID=d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57;K8S_POD_UID=08346d8f-7d22-4d70-ab40-538a67e21e3c" Path:"" ERRORED: error configuring pod [openshift-etcd/revision-pruner-11-master-0] networking: Multus: [openshift-etcd/revision-pruner-11-master-0/08346d8f-7d22-4d70-ab40-538a67e21e3c]: error waiting for pod: pod "revision-pruner-11-master-0" not found
The same signature has been reported for both azure and x390x as well.
It is worth mentioning that sdn to ovn transition adds some complication to our analysis. From the component readiness above, you will see most of the failures are for job: periodic-ci-openshift-release-master-nightly-X.X-upgrade-from-stable-X.X-e2e-metal-ipi-ovn-upgrade. This is a new job for 4.17 and therefore miss base stats in 4.16.
So we ask for:
Please review the following PR: https://github.com/openshift/must-gather/pull/441
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-operator/pull/115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Update our CPO and HO dockerfiles to use appropriate base image versions.
Description of problem:
When running a cypress test locally, with auth disabled, while logged in to kubeadmin, (e.g., running pipeline-ci.feature within test-cypress-pipelines), the before each fails because it expects there to be an empty message, when we are actually logged into kubeadmin
Version-Release number of selected component (if applicable):
4.18
How reproducible:
always
Steps to Reproduce:
1. Run console the ./contrib/oc-environment.sh way while logged into kubeadmin 2. Run pipeline-ci.feature within the test-cypress-pipelines yarn script in the frontend folder
Actual results:
The after-each of the tests fail
Expected results:
The after-each of the tests are allowed to pass
Additional info:
As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.
This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.
Steps to reproduce:
This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).
Description of problem:
See https://github.com/prometheus/prometheus/issues/14503 for more details
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:
# TYPE requests_per_second_requests gauge # UNIT requests_per_second_requests requests # HELP requests_per_second_requests test-description requests_per_second_requests 16 1722466225604 requests_per_second_requests 14 1722466226604 requests_per_second_requests 40 1722466227604 requests_per_second_requests 15 1722466228604 # EOF
2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:
Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)
Additional info:
Regression introduced in Prometheus 2.52. Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/850
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oc/pull/1871
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The helm charts in the monitoring-plugin can currently either deploy the monitoring-plugin in it's CMO state or with the acm-alerting feature flag enabled. Update it so that it can work with the incidents feature flag as well.
Description of problem:
Should save the release signature in the archive tar file instead of count on the enterprise cache (or working-dir)
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1) Prepare data for enterprise registry use mirror2disk+disk2mirror mode with the following command : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: graph: true channels: - name: stable-4.15 `oc-mirror -c config-38037.yaml file://out38037 --v2` `oc-mirror -c config-38037.yaml --from file://out38037 docker://my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com --v2 --dest-tls-verify=false` 2) Prepare the env to simulate the enclave cluster : cat /etc/squid/squid.conf http_port 3128 coredump_dir /var/spool/squid acl whitelist dstdomain "/etc/squid/whitelist" http_access allow whitelist http_access deny !whitelist cat /etc/squid/whitelist my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com -------------registry route (oc get route -n your registry app's project) update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-88.qe.devcluster.openshift.com ---osus route (oc get route -n openshift-update-service) Sudo systemctl restart squid export https_proxy=http://127.0.0.1:3128 export http_proxy=http://127.0.0.1:3128 Setting registry redirect with : cat ~/.config/containers/registries.conf [[registry]] location = "quay.io" insecure = false blocked = false mirror-by-digest-only = false prefix = "" [[registry.mirror]] location = "my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com" insecure = false 3) Simulate enclave mirror with same imagesetconfig with command : `oc-mirror -c config-38037.yaml file://new-folder --v2`
Actual results:
3) The mirror2disk failed with error :
I0812 06:45:26.026441 199941 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-417.qe.devcluster.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=a6097264-8b29-438f-9e71-4aba1e9ec32d
2024/08/12 06:45:26 [ERROR] : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=0f55261077557d1bb909c06b115e0c79b0025677be57ba2f045495c11e2443ee/signature-1": Forbidden
Expected results:
No error and should contain the signature in the archives tar file , not count on the enterprise cache (From custom usage, they may on different machine for enclave cluster , or they may not use the same directory )
Description of problem:
From the output of "oc adm upgrade --help": ... --to-latest=false: Use the next available version. ... seems like "Use the latest available version" is more appropriate.
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
100%
Steps to Reproduce:
1. [kni@ocp-edge119 ~]$ oc adm upgrade --help
Actual results:
... --to-latest=false: Use the next available version. ...
Expected results:
... --to-latest=false: Use the latest available version. ...
Additional info:
Description of problem:
NodePool Controller doesn't respect LatestSupportedVersion https://github.com/openshift/hypershift/blob/main/support/supportedversion/version.go#L19
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create HostedCluster / NodePool 2. Upgrade both HostedCluster and NodePool at the same time to a version higher than the LatestSupportedVersion
Actual results:
NodePool tries to upgrade to the new version while the HostedCluster ValidReleaseImage condition fails with: 'the latest version supported is: "x.y.z". Attempting to use: "x.y.z"'
Expected results:
NodePool ValidReleaseImage condition also fails
Additional info:
Description of problem:
IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service.
Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected.
IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16
Without this functionality IBM ROKS is not able to GA OCP 4.17
As a user of HyperShift, I want to be able to:
so that I can achieve
Description of criteria:
N/A
% oc adm release info quay.io/openshift-release-dev/ocp-release:4.14.33-multi -a ~/all-the-pull-secrets.json --pullspecs | grep apiserver apiserver-network-proxy
This does not require a design proposal.
This does not require a feature gate.
Description of the problem:
The ingress TLS certificate, which is the one presented to HTTP clients e.g. when requesting resources under *.apps.<cluster-name>.<base-domain>, is not signed by a certificate included in the cluster's CA certificates. This results in those ingress HTTP requests to fail with the error: `tls: failed to verify certificate: x509: certificate signed by unknown authority`.
How reproducible:
100%
Steps to reproduce:
1. Before running an IBU, verify that the target cluster's ingress works properly:
2. Run an IBU.
3. Perform steps 1. and 2. again. You will see the error `curl: (60) SSL certificate problem: self signed certificate in certificate chain`.
Alternative steps using openssl:
1. Run an IBU
2. Download the cluster's CA bundle `oc config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}' | base64 --decode > ca.crt`
3. Download the ingress certificate `openssl s_client -connect oauth.apps.target.ibo1.redhat.com:443 -showcerts </dev/null </dev/null 2>/dev/null | awk '/BEGIN CERTIFICAT/,/END CERTIFICATE/ {print}' > ingress.crt`
4. Try to verify the cert with the CA chain: `openssl verify -CAfile ca.crt ingress.crt` - this step fails.
Actual results:
Ingress HTTP requests using the cluster's CA TLS transport fail with unknown certificate authority error.
Expected results:
Ingress HTTP requests using the cluster's CA TLS transport should succeed.
Related to a component regression we found that looked like we had no clear test to catch, sample runs:
All three runs show a pattern. The actual test failures look unpredictable, some tests are passing at the same time, others fail to talk to the apiserver.
The pattern we see is 1 or more tests failing right at the start of e2e testing, disruption, etcd log messages indicating slowness, and etcd leadership state changes.
Because the tests are unpredictable, we'd like a test that catches this symptom. We think the safest way to do this is to look for disruption within x minutes of the first e2e test.
This would be implemented as a monitortest, likely somewhere around here: https://github.com/openshift/origin/blob/master/pkg/monitortests/kubeapiserver/legacykubeapiservermonitortests/monitortest.go
Although it would be reasonable to add a new monitortest in the parent package above this level.
The test would need to do the following:
Description of problem:
This is essentially an incarnation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1312444 that was fixed in OpenShift 3 but is now present again.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Select a template in the console web UI, try to enter a multiline value.
Actual results:
It's impossible to enter line breaks.
Expected results:
It should be possible to achieve entering a multiline parameter when creating apps from templates.
Additional info:
I also filed an issue here https://github.com/openshift/console/issues/13317. P.S. It's happening on https://openshift-console.osci.io, not sure what version of OpenShift they're running exactly.
Description of problem:
We need to bump the Kubernetes Version. To the latest API version OCP is using. This what was done last time: https://github.com/openshift/cluster-samples-operator/pull/409 Find latest stable version from here: https://github.com/kubernetes/api This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Version-Release number of selected component (if applicable):
How reproducible:
Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).
Description of problem:
There is another panic occurred in https://issues.redhat.com/browse/OCPBUGS-34877?focusedId=25580631&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25580631 which should be fixed
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
AdditionalTrustedCA is not wired correctly so the configmap is not found my its operator. This feature is meant to be exposed by XCMSTRAT-590, but at the moment it seems to be broken
Version-Release number of selected component (if applicable):
4.16.5
How reproducible:
Always
Steps to Reproduce:
1. Create a configmap containing a registry and PEM cert, like https://github.com/openshift/openshift-docs/blob/ef75d891786604e78dcc3bcb98ac6f1b3a75dad1/modules/images-configuration-cas.adoc#L17 2. Refer to it in .spec.configuration.image.additionalTrustedCA.name 3. image-registry-config-operator is not able to find the cm and the CO is degraded
Actual results:
CO is degraded
Expected results:
certs are used.
Additional info:
I think we may miss a copy of the configmap from the cluster NS to the target ns. It should be also deleted if it is deleted.
% oc get hc -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd -o jsonpath="{.items[0].spec.configuration.image.additionalTrustedCA}" | jq { "name": "registry-additional-ca-q9f6x5i4" }
% oc get cm -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd registry-additional-ca-q9f6x5i4 NAME DATA AGE registry-additional-ca-q9f6x5i4 1 16m
logs of cluster-image-registry operator
E0814 13:22:32.586416 1 imageregistrycertificates.go:141] ImageRegistryCertificatesController: unable to sync: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found, requeuing
CO is degraded
% oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
console 4.16.5 True False False 3h58m
csi-snapshot-controller 4.16.5 True False False 4h11m
dns 4.16.5 True False False 3h58m
image-registry 4.16.5 True False True 3h58m ImageRegistryCertificatesControllerDegraded: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found
ingress 4.16.5 True False False 3h59m
insights 4.16.5 True False False 4h
kube-apiserver 4.16.5 True False False 4h11m
kube-controller-manager 4.16.5 True False False 4h11m
kube-scheduler 4.16.5 True False False 4h11m
kube-storage-version-migrator 4.16.5 True False False 166m
monitoring 4.16.5 True False False 3h55m
Description of problem:
When an image is referenced by tag and digest, oc-mirror skips the image
Version-Release number of selected component (if applicable):
How reproducible:
Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator
Steps to Reproduce:
1 mirror to disk 2 disk to mirror
Actual results:
docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format
Expected results:
The image should be mirrored
Additional info:
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/128
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Creating a tuned profile with annotation tuned.openshift.io/deferred: "update" first before label target node, then label node with profile=, the value of kernel.shmmni applied immediately. but it shows the message [The TuneD daemon profile is waiting for the next node restart: openshift-profile], then reboot nodes, it will restore to default value of kernel.shmmni, not setting to expected value.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Creating OCP cluster with latest 4.18 nightly version 2. Create tuned profile before label node please refer to issue 1 if you want to reproduce the issue in the doc https://docs.google.com/document/d/1h-7AIyqf7sHa5Et2XF7a-RuuejwVkrjhiFFzqZnNfvg/edit
Actual results:
It should show the message [TuneD profile applied]. the sysctl value should keep as expect after node reboot
Expected results:
It shouldn't show the message The TuneD daemon profile is waiting for the next node restart: openshift-profile when executing oc get profile also the sysctl value shouldn't revert after node reboot
Additional info:
Description of problem:
(MISSING) output with the `oc adm must-gather --help` output [`4.15.0` - `oc`](https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.15.0/ppc64le/openshift-client-linux.tar.gz) introduces strange output with the `oc adm must-gather --help` output.
Version-Release number of selected component (if applicable):
4.15.0 and higher
How reproducible:
4.15.0 and higher you can run the reproducer steps
Steps to Reproduce:
1.curl -O -L https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.15.0/ppc64le/openshift-client-linux.tar.gz 2. untar 3. ./oc adm must-gather --help
Actual results:
# ./oc adm must-gather --help --volume-percentage=30: Specify maximum percentage of must-gather pod's allocated volume that can be used. If this limit is exceeded, must-gather will stop gathering, but still copy gathered data. Defaults to 30%!(MISSING)
Expected results:
No (MISSING) content in the output
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/207
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/console/pull/14238
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Trying to create a cluster from UI , fails.
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
Create VPC and subnets with following configs [refer to attached CF template]: Subnets (subnets-pair-default) in CIDR 10.0.0.0/16 Subnets (subnets-pair-134) in CIDR 10.134.0.0/16 Subnets (subnets-pair-190) in CIDR 10.190.0.0/16 Create cluster into subnets-pair-134, the bootstrap process fails [see attached log-bundle logs]: level=debug msg=I0605 09:52:49.548166 937 loadbalancer.go:1262] "adding attributes to load balancer" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" attrs=[{"Key":"load_balancing.cross_zone.enabled","Value":"true"}] level=debug msg=I0605 09:52:49.909861 937 awscluster_controller.go:291] "Looking up IP address for DNS" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" dns="yunjiang29781a-86-rvqd9-int-19a9485653bf29a1.elb.us-east-2.amazonaws.com" level=debug msg=I0605 09:52:53.483058 937 reflector.go:377] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: forcing resync level=debug msg=Fetching Bootstrap SSH Key Pair... Checking security groups: <infraid>-lb allows 10.0.0.0/16:6443 and 10.0.0.0/16:22623 <infraid>-apiserver-lb allows 10.0.0.0/16:6443 and 10.134.0.0/16:22623 (and 0.0.0.0/0:6443) are these settings correct?
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-03-060250
How reproducible:
Always
Steps to Reproduce:
1. Create subnets using attached CG template 2. Create cluster into subnets which CIDR is 10.134.0.0/16 3.
Actual results:
Bootstrap process fails.
Expected results:
Bootstrap succeeds.
Additional info:
No issues if creating cluster into subnets-pair-default (10.0.0.0/16) No issues if only one CIDR in VPC, e.g. set VpcCidr to 10.134.0.0/16 in https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml
Description of problem:
When using SecureBoot tuned reports the following error as debugfs access is restricted:
tuned.utils.commands: Writing to file '/sys/kernel/debug/sched/migration_cost_ns' error: '[Errno 1] Operation not permitted: '/sys/kernel/debug/sched/migration_cost_ns''
tuned.plugins.plugin_scheduler: Error writing value '5000000' to 'migration_cost_ns'
This issue has been reported with the following tickets:
As this is a confirmed limitation of the NTO due to the TuneD component, we should document this as a limitation in the OpenShift Docs:
https://docs.openshift.com/container-platform/4.16/nodes/nodes/nodes-node-tuning-operator.html
Expected Outcome:
Please review the following PR: https://github.com/openshift/telemeter/pull/543
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/146
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The smoke test for OLM run by the OpenShift e2e suite is specifying an unavailable operator for installation, causing it to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Always (when using 4.17+ catalog versions)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The customer uses Azure File CSI driver and without this they cannot make use of the Azure Workload Identity work which was one of the banner features of OCP 4.14. This feature is currently available in 4.16, however it will take the customer 3-6 months to validate 4.16 and start its rollout putting their plans to complete a large migration to Azure by end of 2024 at risk. Could you please backport either the 1.29.3 feature for Azure Workload Idenity or rebase our Azure File CSI driver in 4.14 and 4.15 to at least 1.29.3 which includes the desired feature.
Version-Release number of selected component (if applicable):
azure-file-csi-driver in 4.14 and 4.15 - In 4.14, azure-file-csi-driver is version 1.28.1 - In 4.15, azure-file-csi-driver is version 1.29.2
How reproducible:
Always
Steps to Reproduce:
1. Install ocp 4.14 with Azure Workload Managed Identity 2. Try to configure Managed Workload Identiy with Azure CSI file https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/workload-identity-static-pv-mount.md
Actual results:
Is not usable
Expected results:
Azure Workload Identity should be manage with Azure File CSi as part of the whole feature
Additional info:
Description of problem:
Sort function on NetworkPolicies page is incorrect after enable Pagination
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-17-060032
How reproducible:
Always
Steps to Reproduce:
1. Create multiple resouces for NetworkPolicies 2. Navigate to Networking -> NetworkPolicies page-> NetworkPolicies Tab 3. Make sure the option of '15 per page' has been selected 4. Click the 'Name column' button to sort the table
Actual results:
The sort result is not correct PFA: https://drive.google.com/file/d/12-eURLqMPZM5DNxfAPoWzX1CJr0Wyf_u/view?usp=drive_link
Expected results:
Table data can be sorted by using resource name, even if pagination is enabled
Additional info:
In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.
cc Ali Mobrem
Description of problem:
co/ingress is always good even operator pod log error: 2024-07-24T06:42:09.580Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-20-191204
How reproducible:
100%
Steps to Reproduce:
1. install AWS cluster 2. update ingresscontroller/default and adding "endpointPublishingStrategy.loadBalancer.allowedSourceRanges", eg spec: endpointPublishingStrategy: loadBalancer: allowedSourceRanges: - 1.1.1.2/32 3. above setting drop most traffic to LB, so some operator degraded
Actual results:
co/authentication and console degraded but co/ingress is still good $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.17.0-0.nightly-2024-07-20-191204 False False True 22m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-aws.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) console 4.17.0-0.nightly-2024-07-20-191204 False False True 22m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ingress 4.17.0-0.nightly-2024-07-20-191204 True False False 3h58m check the ingress operator log and see: 2024-07-24T06:59:09.588Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Expected results:
co/ingress status should reflect the real condition timely
Additional info:
even co/ingress status can be updated in some scenarios, but it is always less sensitive than authentication and console, we always rely on authentication/console to know the route healthy, the purpose of ingress canary route becomes meaningless.
Since, we are not going to support addition of 2nd vCenter as day-2 operation - we need to block users from doing this.
It looks like the must gather pods are the worst culprits but these are not actually considered to be platform pods.
Step 1: Exclude must gather pods from this test.
Step 2: Research the other failures.
Description of the problem:
the GPU data in our host inventory is wrong
How reproducible:
Always
Steps to reproduce:
1.
2.
3.
Actual results:
"gpus": [ \{ "address": "0000:00:0f.0" } ],
Expected results:
Description of problem:
OCPBUGS-42772 is verified. But testing found oauth-server panic with OAuth2.0 idp names that contain whitespaces
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-31-190119
How reproducible:
Always
Steps to Reproduce:
1. Set up Google IDP with below: $ oc create secret generic google-secret-1 --from-literal=clientSecret=xxxxxxxx -n openshift-config $ oc edit oauth cluster spec: identityProviders: - google: clientID: 9745..snipped..apps.googleusercontent.com clientSecret: name: google-secret-1 hostedDomain: redhat.com mappingMethod: claim name: 'my Google idp' type: Google ...
Actual results:
oauth-server panic:
$ oc get po -n openshift-authentication NAME READY STATUS RESTARTS oauth-openshift-59545c6f5-dwr6s 0/1 CrashLoopBackOff 11 (4m10s ago) ... $ oc logs -p -n openshift-authentication oauth-openshift-59545c6f5-dwr6s Copying system trust bundle I1101 03:40:09.883698 1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key" I1101 03:40:09.884046 1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com" I1101 03:40:10.335739 1 audit.go:340] Using audit backend: ignoreErrors<log> I1101 03:40:10.347632 1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController panic: parsing "/oauth2callback/my Google idp": at offset 0: invalid method "/oauth2callback/my"goroutine 1 [running]: net/http.(*ServeMux).register(...) net/http/server.go:2738 net/http.(*ServeMux).Handle(0x29844c0?, {0xc0008886a0?, 0x2984420?}, {0x2987fc0?, 0xc0006ff4a0?}) net/http/server.go:2701 +0x56 github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthenticationHandler(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450}) github.com/openshift/oauth-server/pkg/oauthserver/auth.go:407 +0x11ad github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthorizeAuthenticationHandlers(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450}) github.com/openshift/oauth-server/pkg/oauthserver/auth.go:243 +0x65 github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).WithOAuth(0xc0006c28c0, {0x2982500, 0xc0000aca80}) github.com/openshift/oauth-server/pkg/oauthserver/auth.go:108 +0x21d github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth(0xc0006c28c0, {0x2982500?, 0xc0000aca80?}, 0xc000785888) github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:342 +0x45 k8s.io/apiserver/pkg/server.completedConfig.New.func1({0x2982500?, 0xc0000aca80?}) k8s.io/apiserver@v0.29.2/pkg/server/config.go:825 +0x28 k8s.io/apiserver/pkg/server.NewAPIServerHandler({0x252ca0a, 0xf}, {0x2996020, 0xc000501a00}, 0xc0005d1740, {0x0, 0x0}) k8s.io/apiserver@v0.29.2/pkg/server/handler.go:96 +0x2ad k8s.io/apiserver/pkg/server.completedConfig.New({0xc000785888?, {0x0?, 0x0?}}, {0x252ca0a, 0xf}, {0x29b41a0, 0xc000171370}) k8s.io/apiserver@v0.29.2/pkg/server/config.go:833 +0x2a5 github.com/openshift/oauth-server/pkg/oauthserver.completedOAuthConfig.New({{0xc0005add40?}, 0xc0006c28c8?}, {0x29b41a0?, 0xc000171370?}) github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:322 +0x6a github.com/openshift/oauth-server/pkg/cmd/oauth-server.RunOsinServer(0xc000451cc0?, 0xc000810000?, 0xc00061a5a0) github.com/openshift/oauth-server/pkg/cmd/oauth-server/server.go:45 +0x73 github.com/openshift/oauth-server/pkg/cmd/oauth-server.(*OsinServerOptions).RunOsinServer(0xc00030e168, 0xc00061a5a0) github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:108 +0x259 github.com/openshift/oauth-server/pkg/cmd/oauth-server.NewOsinServerCommand.func1(0xc00061c300?, {0x251a8c8?, 0x4?, 0x251a8cc?}) github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:46 +0xed github.com/spf13/cobra.(*Command).execute(0xc000780008, {0xc00058d6c0, 0x7, 0x7}) github.com/spf13/cobra@v1.7.0/command.go:944 +0x867 github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a3b08) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 k8s.io/component-base/cli.run(0xc0001a3b08) k8s.io/component-base@v0.29.2/cli/run.go:146 +0x290 k8s.io/component-base/cli.Run(0xc00061a5a0?) k8s.io/component-base@v0.29.2/cli/run.go:46 +0x17 main.main() github.com/openshift/oauth-server/cmd/oauth-server/main.go:46 +0x2de
Expected results:
No panic
Additional info:
Tried in old env like 4.16.20 with same steps, no panic: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.20 True False 95m Cluster version is 4.16.20 $ oc get po -n openshift-authentication NAME READY STATUS RESTARTS AGE oauth-openshift-7dfcd8c8fd-77ltf 1/1 Running 0 116s oauth-openshift-7dfcd8c8fd-sr97w 1/1 Running 0 89s oauth-openshift-7dfcd8c8fd-tsrff 1/1 Running 0 62s
Description of problem:
New monitor test api-unreachable-from-client-metrics does not pass in MicroShift. Since this is a monitor test there is no way to skip it and a fix is needed. This test is breaking conformance job for MicroShift, which is critical to the blocking job to be.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Run conformance over MicroShift.
Steps to Reproduce:
1. 2. 3.
Actual results:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance/1828583537415032832
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/images/pull/191
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
oc-mirror should fail when call the cincinatti API failed
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1) Set squid proxy; 2) use following imagesetconfig to mirror ocp: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: graph: true channels: - name: stable-4.15 type: ocp minVersion: '4.15.18' maxVersion: '4.15.18' oc-mirror -c config.yaml file://out38037 --v2
Actual results:
2) oc-mirror failed to get cincinatti API, but oc-mirror just log an error, state that 0 images to copy and continue oc-mirror -c config-38037.yaml file://out38037 --v2 2024/08/13 04:27:41 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/08/13 04:27:41 [INFO] : 👋 Hello, welcome to oc-mirror 2024/08/13 04:27:41 [INFO] : ⚙️ setting up the environment for you... 2024/08/13 04:27:41 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/08/13 04:27:41 [INFO] : 🕵️ going to discover the necessary images... 2024/08/13 04:27:41 [INFO] : 🔍 collecting release images... I0813 04:27:41.388376 203687 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=1454eaf7-7f41-4678-ae88-30d4957e24f9 2024/08/13 04:27:41 [ERROR] : get release images: error list APIRequestError: channel "stable-4.15": RemoteFailed: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=1454eaf7-7f41-4678-ae88-30d4957e24f9": Forbidden 2024/08/13 04:27:41 [WARN] : error during graph image processing - SKIPPING: Get "https://api.openshift.com/api/upgrades_info/graph-data": Forbidden 2024/08/13 04:27:41 [INFO] : 🔍 collecting operator images... 2024/08/13 04:27:41 [INFO] : 🔍 collecting additional images... 2024/08/13 04:27:41 [INFO] : 🚀 Start copying the images... 2024/08/13 04:27:41 [INFO] : images to copy 0 2024/08/13 04:27:41 [INFO] : === Results === 2024/08/13 04:27:41 [INFO] : 📦 Preparing the tarball archive... 2024/08/13 04:27:41 [INFO] : mirror time : 464.620593ms 2024/08/13 04:27:41 [INFO] : 👋 Goodbye, thank you for using oc-mirror
Expected results:
when Cincinatti API is not reacheable (api.openshift.com), oc-mirror should fail immediately
Expected results:
networking-console-plugin deployment has the required-scc annotation
Additional info:
The deployment does not have any annotation about it
CI warning
# [sig-auth] all workloads in ns/openshift-network-console must set the 'openshift.io/required-scc' annotation annotation missing from pod 'networking-console-plugin-7c55b7546c-kc6db' (owners: replicaset/networking-console-plugin-7c55b7546c); suggested required-scc: 'restricted-v2'
Please review the following PR: https://github.com/openshift/installer/pull/8962
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section
Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section
Additional info:
From david:
pod/metal3-static-ip-set namespace/openshift-machine-api should trip some kind of test due to restartCount=5 on its container. Let’s say any pod that is created after the install is finished should restart=0 and see how many fail that criteria
We should have a test that makes sure that pods created after cluster is up should not have a non zero restartCount.
Please review the following PR: https://github.com/openshift/openshift-controller-manager/pull/330
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1140
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When installing a GCP cluster with the CAPI based method, the kube-api firewall rule that is created always uses a source range of 0.0.0.0/0. In the prior terraform based method, internal published clusters were limited to the network_cidr. This change opens up the API to additional sources, which could be problematic such as in situations where traffic is being routed from a non-cluster subnet.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster in GCP with publish: internal 2. 3.
Actual results:
Kube-api firewall rule has source of 0.0.0.0/0
Expected results:
Kube-api firewall rule has a more limited source of network_cidr
Additional info:
Please review the following PR: https://github.com/openshift/csi-operator/pull/87
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tests:
are being disabled in https://github.com/openshift/kubernetes/blob/master/openshift-hack/e2e/annotate/rules.go
These tests should be enabled after the 1.31 kube bump in oc
Component Readiness has found a potential regression in the following test:
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry
Probability of significant regression: 98.02%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0
Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.
The problem appears to be a permissions error preventing the pods from starting:
2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied
Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489
Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:
container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch
With slightly different versions in each stream, but both were on 3-2.231.
Hits other tests too:
operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]
Description of problem:
checked in 4.17.0-0.nightly-2024-09-18-003538, default thanos-ruler retention time is 24h, not 15d mentioned in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.17/Documentation/api.md#thanosrulerconfig, the issue exists in 4.12+
$ for i in $(oc -n openshift-user-workload-monitoring get sts --no-headers | awk '{print $1}'); do echo $i; oc -n openshift-user-workload-monitoring get sts $i -oyaml | grep retention; echo -e "\n"; done prometheus-user-workload - --storage.tsdb.retention.time=24h thanos-ruler-user-workload - --tsdb.retention=24h
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-18-003538
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
default thanos-ruler retention time is 15d in api.md
Expected results:
should be 24h
Additional info:
Related with https://issues.redhat.com/browse/OCPBUGS-23000
Cluster-autoscaler by default evict all those pods -including those coming from daemon sets- In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption
Additional info:
Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"
Description of problem:
In discussion of https://issues.redhat.com/browse/OCPBUGS-37862 it was noticed that sometimes the haproxy-monitor is reporting "API is not reachable through HAProxy" which means it is removing the firewall rule to direct traffic to HAProxy. This is not ideal since it means keepalived will likely fail over the VIP and it may be breaking existing connections to HAProxy.
There are a few possible reasons for this. One is that we only require two failures of the healthcheck in the monitor to trigger this removal. For something we don't expect to need to happen often during normal operation of a cluster, this is probably a bit too harsh, especially since we only check every 6 seconds so it's not like we're looking for quick error detection. This is more a bootstrapping thing and a last ditch effort to keep the API functional if something has gone terribly wrong in the cluster. If it takes a few more seconds to detect an outage that's better than detecting outages that aren't actually outages.
The first thing we're going to try to fix this is to increase what amounts to the "fall" value for the monitor check. If that doesn't eliminate the problem we will have to look deeper at the HAProxy behavior during node reboots.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Panic seen in below CI job when run the below command
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-insights-operator-release-4.17-insights-operator-e2e-tests-periodic (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
Panic observed:
E0910 09:00:04.283647 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 268 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x36c8b40, 0x5660c90}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ce8540?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x36c8b40?, 0x5660c90?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000d6e360, {0x3abd580?, 0xc00224a608}, {0x3abd580?, 0xc001bd2308}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:585 +0x1f3 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001933f70, {0x3faaba0, 0xc000759710}, 0x1, 0xc00097bda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000750f70, 0x3b9aca00, 0x0, 0x1, 0xc00097bda0) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(*processorListener).run(0xc000dc2630) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69 k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1() /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52 created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 261 /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x33204b3]
Version-Release number of selected component (if applicable):
How reproducible:
Seen in this CI run -https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic
Steps to Reproduce:
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'
Actual results:
Expected results:
No panic to observe
Additional info:
Failures beginning in 4.18.0-0.ci-2024-10-08-185524
Suite run returned error: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required ) error running options: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required )error: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required )
Undiagnosed panic detected in pod
This test is failing the majority of the time on hypershift jobs.
The failure looks straightforward:
{ pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-18-18.ec2.internal_cluster-policy-controller.log.gz:E1015 12:53:31.246033 1 scctopsamapping.go:336] "Observed a panic" panic="unknown volume type: image" panicGoValue="&errors.errorString{s:\"unknown volume type: image\"}" stacktrace=<
We're close to not being able to see, but it looks like this may have started Oct 3rd.
For job runs with the test failure see here.
Description of problem:
When hosted zones are created in the cluster creator account, and the ingress role is a role in the cluster creator account, the private link controller fails to create DNS records in the local zone.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Set up shared vpc infrastructure in which the hosted zone and local zone exist in the cluster creator account. 2. Create a hosted cluster
Actual results:
The hosted cluster never gets nodes to join because it is missing records in the local hosted zone.
Expected results:
The hosted cluster completes installation with available nodes.
Additional info:
Creating the hosted zones in the cluster creator account is an alternative way of setting up shared vpc infrastructure. In this mode, the role to assume for creating DNS records is a role in the cluster creator account and not in the vpc account.
Description of problem:
The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test frequently fails on OpenStack platform, which in turn also causes the [sig-network] can collect pod-to-service poller pod logs and [sig-network] can collect host-to-service poller pod logs tests to fail.
These failure happen frequently in vh-mecha, for example for all CSI jobs, such as 4.16-e2e-openstack-csi-cinder.
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/442
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
These two tests have been flaking more often lately. The TestLeaderElection flake is partially (but not solely) connected to OCPBUGS-41903. TestOperandProxyConfiguration seems to fail in the teardown while waiting for other cluster operators to become available. Although these flakes aren't customer facing, they considerably slow development cycles (due to retests) and also consume more resources than they should (every retest runs on a new cluster), so we want to backport the fixes.
Version-Release number of selected component (if applicable):
4.18, 4.17, 4.16, 4.15, 4.14
How reproducible:
Sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
installing into GCP shared VPC with BYO hosted zone failed with error "failed to create the private managed zone"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-multi-2024-08-26-170521
How reproducible:
Always
Steps to Reproduce:
1. pre-create the dns private zone in the service project, with the zone's dns name like "<cluster name>.<base domain>" and binding to the shared VPC 2. activate the service account having minimum permissions, i.e. no permission to bind a private zone to the shared VPC in the host project (see [1]) 3. "create install-config" and then insert the interested settings (e.g. see [2]) 4. "create cluster"
Actual results:
It still tries to create a private zone, which is unexpected. failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create the private managed zone: failed to create private managed zone: googleapi: Error 403: Forbidden, forbidden
Expected results:
The installer should use the pre-configured dns private zone, rather than try to create a new one.
Additional info:
The 4.16 epic adding the support: https://issues.redhat.com/browse/CORS-2591 One PROW CI test which succeeded using Terraform installation: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-4.17-upgrade-from-stable-4.17-gcp-ipi-xpn-mini-perm-byo-hosted-zone-arm-f28/1821177143447523328 The PROW CI test which failed: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-xpn-mini-perm-byo-hosted-zone-amd-f28-destructive/1828255050678407168
Description of problem:
OCP Conformance MonitorTests can fail based on CSI Drivers pod and ClusterRole applied order. SA, CR, CRB likely should be applied first prior to deployment/pods.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
60%
Steps to Reproduce:
1. Create IPI cluster on IBM Cloud 2. Run OCP Conformance w/ MonitorTests
Actual results:
: [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] { fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "ibm-vpc-block-csi-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[2].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/ibm-vpc-block-csi-node -n openshift-cluster-csi-drivers happened 7 times Ginkgo exit error 1: exit with code 1}
Expected results:
No pod creation failures using the wrong SCC, because the ClusterRole/ClusterRoleBinding, etc. had not been applied yet.
Additional info:
Sorry, I did not see an IBM Cloud Storage listed in the targeted Component for this bug, so selected the generic Storage component. Please forward as necessary/possible. Items to consider: ClusterRole: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/privileged_role.yaml ClusterRoleBinding: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/node_privileged_binding.yaml The ibm-vpc-block-csi-node-* pods eventually reach running using privileged SCC. I do not know whether it is possible to stage the resources that get created first, within the CSI Driver Operator https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/9288e5078f2fe3ce2e69a4be3d94622c164c3dbd/pkg/operator/starter.go#L98-L99 Prior to the CSI Driver daemonset (`node.yaml`), perhaps order matters within the list. Example of failure in CI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8235/pull-ci-openshift-installer-master-e2e-ibmcloud-ovn/1836521032031145984
Description of problem:
On "Search" page, search resource VolumeSnapshots/VolumeSnapshotClasses and filter with label, the filter doesn't work.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-04-08-024331
How reproducible:
always
Steps to Reproduce:
To reproduce VolumeSnapshot bug: 1. Go to VolumeSnapshots page under a namespace that has VolumeSnapshotClaims defined (e.g. openshift-pipelines) 2. Create two new VolumeSnapshots - use one of the defined VolumeSnapshotClaims during creation. 3. Click on one of the created VolumeSnapshots and add a label - e.g. "demoLabell". 4. Go to "Search" page, choose "VolumeSnapshots" resource, filter with any label, eg "demoLabel", "something" To reproduce VolumeSnapshotClass bug: 1. Go to VolumeSnapshotsClasses page 2. Create two new VolumeSnapshotClasses. 3. Click on one of the created VolumeSnapshotClasses and add a label - e.g. "demoLabel". 4. Go to "Search" page, choose "VolumeSnapshots" resource, filter with any label, eg "demoLabel", "something"
Actual results:
1. Label filters don't work. 2. VolumeSnapshots are listed without being filtered by label. 2. VolumeSnapshotClasses are listed without being filtered by label.
Expected results:
1. VSs and VSCs should be filtered by label.
Additional info:
Screenshots VS: https://drive.google.com/drive/folders/1GEUgOn5FXr-l3LJNF-FWBmn-bQ8uE_rD?usp=sharing Screenshoft VSC: https://drive.google.com/drive/folders/1gI7PNCzcCngfmFT5oI1D6Bask5EPsN7v?usp=sharing
Description of problem:
%s is not populated with authoritativeAPI , when cluster is enabled for migration
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-23-182657
How reproducible:
Always
Steps to Reproduce:
Set featuregate as below
spec: featureSet: CustomNoUpgrade customNoUpgrade: enabled: - MachineAPIMigration Update - oc edit --subresource status` to add the `.status.authoritativeAPI` field to see the behaviour of the pausing. eg- oc edit --subresource status machineset.machine.openshift.io miyadav-2709a-5v7g7-worker-eastus2
Actual results:
status: availableReplicas: 1 conditions: - lastTransitionTime: "2024-09-27T07:22:58Z" reason: AuthoritativeAPI is set to MachineAPI severity: The AuthoritativeAPI is set to %s status: "False" type: Paused fullyLabeledRepl
Expected results:
status: availableReplicas: 1 conditions: - lastTransitionTime: "2024-09-27T07:22:58Z" reason: AuthoritativeAPI is set to MachineAPI severity: The AuthoritativeAPI is set to MachineAPI status: "False" type: Paused fullyLabeledRepl
Additional info:
related to - https://issues.redhat.com/browse/OCPCLOUD-2565 message: 'The AuthoritativeAPI is set to ' reason: AuthoritativeAPIMachineAPI severity: Info status: "False" type: Paused
Description of problem:
Specifying additionalTrustBundle in the HC doesnt propogate down to the worker nodes
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1.Create CM with additionalTrustBundle 2.Specify CM in HC.Spec.AdditionalTrustBundle 3.Debug worker nodes and check if additionalTrustBundle has been updated
Actual results:
additionalTrustBundle hasnt propogated down to nodes
Expected results:
additionalTrustBundle propogated down to nodes
Additional info:
Description of problem:
Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5: step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000 spec used: spec: raid: hardwareRAIDVolumes: - name: test-vol level: "1" numberOfPhysicalDisks: 2 sizeGibibytes: 350 online: true
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Provision an HEP worker with ILO 5 using redfish 2. 3.
Actual results:
Expected results:
Additional info:
Software production has changed the key they want ART to sign with. ART is currently signing with the original key we were provided and sigstore-3.
Allow CMO tests to be linted as well.
Description of problem:
After configuring remote-write for UWM prometheus named "user-workload" in configmap named user-workload-monitoring-config, the proxyURL (same as cluster proxy resource) is not getting injected at all.
Version-Release number of selected component (if applicable):
4.16.4
How reproducible:
100%
Steps to Reproduce:
1. Configure proxy custom resource in RHOCP 4.16.4 cluster 2. Create user-workload-monitoring-config configmap in openshift-monitoring project 3. Inject remote-write config (without specifically configuring proxy for remote-write) 4. After saving the modification in user-workload-monitoring-config configmap, check the remoteWrite config in Prometheus user-workload CR. Now it does NOT contain the proxyUrl. Example snippet: ============== apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: [...] name: user-workload namespace: openshift-user-workload-monitoring spec: [...] remoteWrite: - url: http://test-remotewrite.test.svc.cluster.local:9090 <<== No Proxy URL Injected
Actual results:
UWM prometheus CR named "user-workload" doesn't inherit the proxyURL from cluster proxy resource.
Expected results:
UWM prometheus CR named "user-workload" should inherit proxyURL from cluster proxy resource and it should also respect noProxy which is configured in cluster proxy.
Additional info:
Description of problem:
CNO doesnt report, as a metric, when there is a network overlap when live migration is initiated.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/200
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-cluster-capi-operator-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Please review the following PR: https://github.com/openshift/thanos/pull/151
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Regions `us-east2` and `us-east3` GCP regions doesn't have zones and when these regions are used for creating cluster, installer crashes with below stack trace.
$ openshift-install version
openshift-install 4.16.0-0.ci.test-2024-04-23-055943-ci-ln-z602w5b-latest
built from commit 0bbbb0261b724628c8e68569f31f86fd84669436
release image registry.build03.ci.openshift.org/ci-ln-z602w5b/release@sha256:a0df6e54dfd5d45e8ec6f2fcb07fa99cf682f7a79ea3bc07459d3ba1dbb47581
release architecture amd64
$ yq-3.3.0 r test4/install-config.yaml platform
gcp:
projectID: openshift-qe
region: us-east2
userTags:
- parentID: 54643501348
key: ocp_tag_dev
value: bar
- parentID: openshift-qe
key: Su.Shi-Jiang_Cheng_Zi
value: SHI NIAN
userLabels:
- key: createdby
value: installer-qe
- key: a
value: 8
$ yq-3.3.0 r test4/install-config.yaml credentialsMode
Passthrough
$ yq-3.3.0 r test4/install-config.yaml featureSet
TechPreviewNoUpgrade
$ yq-3.3.0 r test4/install-config.yaml metadata.name
jiwei-0424a
$
$ openshift-install create cluster --dir test4
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
panic: runtime error: index out of range [0] with length 0goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/installconfig/gcp.(*Client).GetMachineTypeWithZones(0xc0017a7f90?, {0x1f6dd998, 0x23ab4be0}, {0xc0007e6650, 0xc}, {0xc0007e6660, 0x8}, {0x7c2b604, 0xd})
/go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/client.go:142 +0x5e8
github.com/openshift/installer/pkg/asset/installconfig/gcp.ValidateInstanceType({0x1f6fe5e8?, 0xc0007e0428?}, 0xc001a7cde0, {0xc0007e6650?, 0x27f?}, {0xc0007e6660?, 0x40ffcf?}, {0xc000efe980, 0x0, 0x0}, ...)
/go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/validation.go:80 +0x6c
github.com/openshift/installer/pkg/asset/installconfig/gcp.validateInstanceTypes({0x1f6fe5e8, 0xc0007e0428}, 0xc00107f080)
/go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/validation.go:189 +0x4f7
github.com/openshift/installer/pkg/asset/installconfig/gcp.Validate({0x1f6fe5e8?, 0xc0007e0428}, 0xc00107f080)
/go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/validation.go:63 +0xf45
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).platformValidation(0xc0011d8f80)
/go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:199 +0x21a
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).finish(0xc0011d8f80, {0x7c518a9, 0x13})
/go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:171 +0x6ce
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).Load(0xc0011d8f80, {0x1f69a550?, 0xc001155c70?})
/go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:112 +0x55
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000c5f440, {0x1f6c8080, 0xc0011d8ac0}, {0xc001163c6c, 0x4})
/go/src/github.com/openshift/installer/pkg/asset/store/store.go:264 +0x33f
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000c5f440, {0x1f6cc230, 0xc001199360}, {0x7c056f3, 0x2})
/go/src/github.com/openshift/installer/pkg/asset/store/store.go:247 +0x23a
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000c5f440, {0x7f88420a5000, 0x23a57c20}, {0x0, 0x0})
/go/src/github.com/openshift/installer/pkg/asset/store/store.go:247 +0x23a
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c5f440, {0x1f6ddab0, 0xc0011b6eb0}, {0x7f88420a5000, 0x23a57c20}, {0x0, 0x0})
/go/src/github.com/openshift/installer/pkg/asset/store/store.go:201 +0x1b1
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7fffbc31408f?, {0x1f6ddab0?, 0xc0011b6eb0?}, {0x7f88420a5000, 0x23a57c20}, {0x23a27e60, 0x8, 0x8})
/go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x54
github.com/openshift/installer/pkg/asset/store.(*fetcher).FetchAndPersist(0xc001155c60, {0x1f6ddab0, 0xc0011b6eb0}, {0x23a27e60, 0x8, 0x8})
/go/src/github.com/openshift/installer/pkg/asset/store/assetsfetcher.go:47 +0x165
main.newCreateCmd.runTargetCmd.func3({0x7fffbc31408f?, 0x5?})
/go/src/github.com/openshift/installer/cmd/openshift-install/create.go:301 +0x6a
main.newCreateCmd.runTargetCmd.func4(0xc000fdf600?, {0xc001199260?, 0x4?, 0x7c06e81?})
/go/src/github.com/openshift/installer/cmd/openshift-install/create.go:315 +0x102
github.com/spf13/cobra.(*Command).execute(0x23a324c0, {0xc001199220, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc001005500)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1039
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:62 +0x385
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:36 +0x11d
Additional info: slack thread discussion https://redhat-internal.slack.com/archives/C01V1DP387R/p1713959395808119
1. Proposed title of this feature request
Collect number of resources in etcd
2. What is the nature and description of the request?
The number of resources is useful in several scenarios, like kube-apiserver high memory usage.
3. Why does the customer need this? (List the business requirements here)
The information will be useful for OpenShift Support. The number of resources is useful in several scenarios, like kube-apiserver high memory usage.
4. List any affected packages or components.
must-gather
This feature conditionally creates a button within the VirtualizedTable component that allows clients to download the data within the table as comma-separated values (.csv).
Both PRs are needed to test the feature.
The PRs are
https://github.com/openshift/console/pull/14050
and
https://github.com/openshift/monitoring-plugin/pull/133
The monitoring-plugin passes a string called 'csvData', which contains metrics data formatted in comma-separated values. The console then consumes the 'csvData' in the component 'VirtualizedTable'. 'VirtualizedTable' renders the 'Export as CSV' button only if this property, 'cvsData' is present. Without the property the button 'Export as CSV' will not render.
The console's CI/CD pipeline > tide requires that issues have a valid Jira reference, presumably in this (OpenShift Console) board. This ticket is a duplication of
https://issues.redhat.com/browse/OU-431
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
Component Readiness has found a potential regression in the following test:
[sig-storage] [Serial] Volume metrics Ephemeral should create volume metrics with the correct BlockMode PVC ref [Suite:openshift/conformance/serial] [Suite:k8s]
Probability of significant regression: 100.00%
Description of problem:
AWS api-int lb is either misconfigured or buggy: it allows connections when apiserver is being shutdown. Termination log has >Request to %q (source IP %s, user agent %q) through a connection created very late in the graceful termination process (more than 80%% has passed), possibly a sign for a broken load balancer setup. messages and in-cluster monitoring suite shows multiple one second disruptions
For troubleshooting OSUS cases, the default must-gather doesn't collect OSUS information, and an inspect of the openshift-update-service namespace is missing several OSUS related resources like UpdateService, ImageSetConfiguration, and maybe more.
Create an specific must-gather image for OSUS (as there are for other operators/components [1]).
The Cluster API provider Azure has a deployment manifest that deploys Azure service operator from mcr.microsoft.com/k8s/azureserviceoperator:v2.6.0 image.
We need to set up OpenShift builds of the operator and update the manifest generator to use the OpenShift image.
Azure have split the API calls out of their provider so that they now use the service operator. We now need to ship service operator as part of the CAPI operator to make sure that we can support CAPZ.
https://redhat-internal.slack.com/archives/C04TMSTHUHK/p1725998554253779
> We're using version 0.1.0, while the current version is 0.5.7
As an openshift engineer keep the vsphere provider up to date with the most current version of capi so that we don't get behind and cause potential future problems.
For disconnected clusters, we will need to move to use ImageDigestMirrorSet (IDMS) since ImageContentSourcePolicy (ICSP) is currently deprecated and will eventually be removed.
There are several scenarios:
For disconnected clusters, we will need to move to use ImageDigestMirrorSet (IDMS) since ImageContentSourcePolicy (ICSP) is currently deprecated and will eventually be removed.
There are several scenarios:
For disconnected clusters, we will need to move to use ImageDigestMirrorSet (IDMS) since ImageContentSourcePolicy (ICSP) is currently deprecated and will eventually be removed.
There are several scenarios:
Currently we use ICSP flag when using oc cli commands. We need to use the IDMS flag instead
To allow for easier injection of values and if/else switches, we should move the existing pod template in pod.yaml to a gotemplate.
AC:
As a developer, I want to run yarn check-cycles in CI so that OCPBUGS-44017 won't occur again
Description of problem:
Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.
Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?
I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:
prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done
It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:
#### 4,5c4,5 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==", < "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==", > "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09" 13c13 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==" #### 4,5c4,5 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==", < "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==", > "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36" 13c13 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==" ####
The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
~20% failure rate in 4.18 vsphere-ovn-serial jobs
Steps to Reproduce:
Actual results:
operator rolls out unnecessary daemonset / deployment changes
Expected results:
don't roll out changes unless there is a spec change
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The goal is to collect metrics about some features used by the OpenTelemetry operator because this will be useful
for improve the product, knowing which features the customer use we can focus our efforts better on improve those features.
opentelemetry_collector_receivers gauge that represents the number of OpenTelemetry collector instances that uses certain receiver
Labels
Cardinality: 12
opentelemetry_collector_exporters gauge that represents the number of OpenTelemetry collector instances that uses certain exporter
Labels
Cardinality: 9
opentelemetry_collector_processors gauge that represents the number of OpenTelemetry collector instances that uses certain processor
Labels
Cardinality: 11
opentelemetry_collector_extensions gauge that represents the number of OpenTelemetry collector instances that uses certain extension
Labels
Cardinality: 10
opentelemetry_collector_connectors gauge that represents the number of OpenTelemetry collector instances that uses certain connector
Labels
Cardinality: 2
opentelemetry_collector_info gauge that represents the number of OpenTelemetry collector instances that uses certain deployment type
Labels
Cardinality: 4
This test failed 3 times in the last week with the following error:
{{
Unknown macro: { KubeAPIErrorBudgetBurn was at or above info for at least 2m28s on platformidentification.JobType{Release}(maxAllowed=0s): pending for 1h33m52s, firing for 2m28s: Sep 16 21:20:56.839 - 148s E namespace/openshift-kube-apiserver alert/KubeAPIErrorBudgetBurn alertstate/firing severity/critical ALERTS{alertname="KubeAPIErrorBudgetBurn", alertstate="firing", long="6h", namespace="openshift-kube-apiserver", prometheus="openshift-monitoring/k8s", severity="critical", short="30m"}}}}
It didn't fail a single time in the previous month on 4.17 nor in the month before we shipped 4.16 so I'm proposing this as a blocker to be investigated. Below you have the boilerplate Component Readiness text:
Component Readiness has found a potential regression in the following test:
[bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info
Probability of significant regression: 99.04%
Sample (being evaluated) Release: 4.17
Start Time: 2024-09-10T00:00:00Z
End Time: 2024-09-17T23:59:59Z
Success Rate: 85.71%
Successes: 18
Failures: 3
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 74
Failures: 0
Flakes: 0
Some of the E2E tests could be considered read-only, such as looping until a PromQL expression is true.
Additionally, some tests are non-disruptive: all their operations are performed within a temporary namespace without impacting the monitoring components' statuses.
We can t.Parallel() them to save some minutes.
Also, we can:
Isolate specific tests to enable parallel execution
Enhance the resilience of some tests and fix those prone to errors.
Fix some tests that were running wrong checks.
Make some the tests idempotent to be easily debugged and run locally
Description of problem:
certrotation controller is using applySecret/applyConfigmap functions from library-go to update secret/configmap. This controller has several replicas running in parallel, so it may overwrite changes applied by a different replica, which leads to unexpected signer updates and corrupted CA bundles. applySecret/applyConfigmap does initial Get and calls Update, which overwrites the changes done to a copy received from the informer. Instead it should issue .Update calls directly using a copy received from the informer, so that etcd would reject a change if its done after the resourceVersion was updated in parallel
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
On "Networking"->"NetworkPolicies" page, when "MultiNetworkPolicies disabled", on "NetworkPolicies" tab, select a project, eg "default" from dropdown list. Then click tab "MultiNetworkPolicies", and click back to "NetworkPolicies" tab, the project dropdown is set to "All Projects" automatically
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-09-09-150616 4.17.0-0.nightly-2024-09-09-120947
How reproducible:
Always
Steps to Reproduce:
1.On "Networking"->"NetworkPolicies" page, when "MultiNetworkPolicies disabled", on "NetworkPolicies" tab, select a project, eg "default" from dropdown list. Then click tab "MultiNetworkPolicies", and click back to "NetworkPolicies" tab 2. 3.
Actual results:
1. The project dropdown is set to "All Projects" automatically
Expected results:
1. The project dropdown should be set to "default" as originally selected.
Additional info:
Add Webb scales and Baiju as helm owners.
The goal is to collect metrics about Cluster Logging Operator 6.y, so that we can track usage of features in the new release version.
"openshift_logging:log_forwarder_pipelines:sum" represents the number of logging pipelines managed by CLO per namespace.
Labels
The cardinality of the metric is "one per namespace", which for most clusters will be one.
"openshift_logging:log_forwarder_pipelines:count" represents the number of deployed ClusterLogForwarders per namespace.
Labels
The cardinality of the metric is "one per namespace", which for most clusters will be one.
"openshift_logging:log_forwarder_input_type:sum" represents the number of inputs managed by CLO per namespace.
Labels
The cardinality of the metric is "one per namespace and input type". I expect this to be two for most customers.
"openshift_logging:log_forwarder_output_type:sum" represents the number of outputs managed by CLO per namespace.
Labels
The cardinality of the metric is "one per namespace and output type". I expect most customers to use one or two output types.
"openshift_logging:vector_component_received_bytes_total:rate5m" represents current total log rate for a cluster for log collectors managed by CLO.
Labels
The cardinality of the metric is "one per namespace". which for most clusters will be one.
Component exposing the metric: https://github.com/openshift/cluster-logging-operator/blob/master/internal/metrics/telemetry/telemetry.go#L25-L47
The recording rules for these metrics are currently reviewed in this PR: https://github.com/openshift/cluster-logging-operator/pull/2823
As a developer I want have automated e2e testing on PRs so that I can make sure the changes for cluster-api-provider-ibmcloud are thoroughly tested.
The monitoring-plugin is still using Patternfly v4; it needs to be upgraded to Patternfly v5. This major version release deprecates components in the monitoring-plugin. These components will need to be replaced/removed to accommodate the version update.
We need to remove the deprecated components from the monitoring plugin, extending the work from CONSOLE-4124
Work to be done:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content
Actual results:
- monitoring-plugin is still using the old cert content. - If the cluster doesn’t show much activity, the hash may take time to be updated.
Expected results:
CMO should detect that exact change and run a sync to recompute and set the new hash.
Additional info:
- We shouldn't rely on another changeto trigger the sync loop. - CMO should maybe watch that secret? (its name isn't known in advance).
Whenever we update dependencies in the main module or the api module, compilation breaks for developers that are using a go workspace locally. We can ensure that the dependencies are kept in sync by running a 'go work sync' in a module where the hypershift repo is a symlinked child.
Along with disruption monitoring via external endpoint we should add in-cluster monitors which run the same checks over:
These tests should be implemented as deployments with anti-affinity landing on different nodes. Deployments are selected so that the nodes could properly be drained. These deployments are writing to host disk and on restart the pod will pick up existing data. When a special configmap is created the pod will stop collecting disruption data.
External part of the test will create deployments (and necessary RBAC objects) when test is started, create stop configmap when it ends and collect data from the nodes. The test will expose them on intervals chart, so that the data could be used to find the source of disruption
Description of problem:
[vmware-vsphere-csi-driver-operator] driver controller/node/webhook update events repeat pathologically
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-03-161006
How reproducible:
Always
Steps to Reproduce:
1. Install an Openshift cluster on vSphere of version 4.17 nightly. 2. Upgrade the cluster to 4.18 nightly. 3. Check the driver controller/node/webhook update events should not repeat pathologically.
CI failure record -> https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-vsphere-ovn-upgrade/1854191939318976512
Actual results:
In step 3: the driver controller/node/webhook update events repeat pathologically
Expected results:
In step 3: the driver controller/node/webhook update events should not repeat pathologically
Additional info:
Description of problem:
While setting userTags in the install-config file for AWS does not support all AWS valid characters as per [1]. platform: aws: region: us-east-1 propagateUserTags: true userTags: key1: "Test Space" key2: value2 ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.aws.userTags[key1]: Invalid value: "Test Space": value contains invalid characters The documentation at: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/installation-config-parameters-aws.html#installation-configuration-parameters-optional-aws_installation-config-parameters-aws does not refer to any restrictions. However: Validation is done here: https://github.com/openshift/installer/blob/74ee94f2a34555a41107a5a7da627ab5de0c7373/pkg/types/aws/validation/platform.go#L106 Which in turn refers to a regex here: https://github.com/openshift/installer/blob/74ee94f2a34555a41107a5a7da627ab5de0c7373/pkg/types/aws/validation/platform.go#L17 Which allows these characters: `^[0-9A-Za-z_.:/=+-@]*$` [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-restrictions).
Version-Release number of selected component (if applicable):
How reproducible:
100 %
Steps to Reproduce:
1. Create a install-config with a value usertags as mention in description. 2. Run the installer.
Actual results:
Command failed with below error: ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.aws.userTags[key1]: Invalid value: "Test Space": value contains invalid characters
Expected results:
Installer should run successfully.
Additional info:
In userTags when the value with space is set then the installer failed to compile the install-config.
The ovnkube-node pods are crash looping with:
1010 23:12:06.421980 6605 ovnkube.go:137] failed to run ovnkube: [failed to initialize libovsdb NB client: failed to connect to unix:/var/run/ovn/ovnnb_db.sock: database OVN_Northbound validation error (8): database model contains a model for table Sample that does not exist in schema. database model contains a model for table Sampling_App that does not exist in schema. Mapper Error. Object type nbdb.ACL contains field SampleEst (*string) ovs tag sample_est: Column does not exist in schema. Mapper Error. Object type nbdb.NAT contains field Match (string) ovs tag match: Column does not exist in schema. database model contains a model for table Sample_Collector that does not exist in schema. Mapper Error. Object type nbdb.LogicalRouterPort contains field DhcpRelay (*string) ovs tag dhcp_relay: Column does not exist in schema. database model contains a model for table DHCP_Relay that does not exist in schema. database model contains a client index for table ACL that does not exist in schema, failed to start node network controller: error in syncing cache for *v1.Pod informer]
The ovn builds for cs9 are old and have not been built with the latest. The team is working to build the rpms and once we have it, we need builds of ovn-kubernetes with the latest ovn rpms to fix this issue.
Description of problem:
There are lots of customers that deploy cluster that are not directly connected to Internet so they use a corporate proxy. Customers have been unable to correctly understand how to configure cluster wide proxy for a new HostedCluster and they are finding issues to deploy the HostedCluster For example, given the following configuration: -- apiVersion: hypershift.openshift.io/v1beta1 kind: HostedCluster metadata: creationTimestamp: null name: cluster-hcp namespace: clusters spec: configuration: proxy: httpProxy: http://proxy.testlab.local:80 httpsProxy: http://proxy.testlab.local:80 noProxy: testlab.local,192.168.0.0/16 -- A customer normally would add the MachineNetwork CIDR and local domain to the noProxy variable. However this will cause a problem in Openshift Virtualization. Hosted Control Plane KAS won't be able to contact node's kubelet since pods will try to reach tcp/10250 through the proxy, causing an error. So in this scenario, it is needed to add the Hub cluster ClusterNetwork CIDR to the noProxy variable: -- noProxy: testlab.local,192.168.0.0/16,10.128.0.0/14 -- However, I was unable to find this information in our documentation. Also, there is a known issue that is explained in the following kcs: https://access.redhat.com/solutions/7068827 The problem is, the Hosted Cluster deploys the control-plane-operator binary instead of the haproxy binary in kube-apiserver-proxy pods under kube-system in the HostedCluster. The kcs explains that the problem is fixed but It is not clear for customer what subnetwork should be added to the noProxy to trigger the logic that deploys the haproxy image so the proxy is not used to expose the kubernetes internal endpoint (172.20.0.1). The code seems to compare if the HostedCluster Clusternetwork (10.132.0.0/14) or ServiceNetwork (172.31.0.0/16) or the internal kubernetes address (172.20.0.1) is listed in the noProxy variable, to honor the noProxy setting and deploy haproxy images. This lead us to under trial and error find the correct way to honor the noProxy and allow the HostedCluster to work correctly and be able to connect from kube-apiserver-proxy pods to hosted KAS and also connect from hosted KAS to kubelet bypassing the cluster wide proxy. The questions are: 1.- Is it possible to add the information in our documentation about what is the correct way to configure a HostedCluster using noProxy variables? 2.- What is the correct subnet that needs to be added to the noProxy variable so the haproxy images are deployed instead of control-plane operator and allow kube-apiserver-proxy pods to bypass the cluster-wide proxy?
Version-Release number of selected component (if applicable):
4.14.z, 4.15.z, 4.16.z
How reproducible:
Deploy a HostedCluster using noProxy variables
Steps to Reproduce:
1. 2. 3.
Actual results:
Components from Hosted Cluster are still using the proxy not honoring the noProxy variables set.
Expected results:
Hosted Cluster should be able to deploy correctly.
Additional info:
There have been several instances where assisted would start downloading ClusterImageSet images and it could cause issues like
Possible solution ideas:
As mention in the previous review when this was added https://github.com/openshift/assisted-service/pull/4650/files#r1044872735 "late binding usecase would be broken for OKD" so to prevent this, we should detect if the infra-env is late bound and not check for the image if it is.
The only time a requested ClusterImageSet is cached is when a Cluster is created.
This leads to problems such as
Install's of recent nightly/stable 4.16 SCOS releases are branded as Openshift instead of OKD.
Testing on the following versions shows incorrect branding on oauth URL
4.16.0-0.okd-scos-2024-08-15-225023
4.16.0-0.okd-scos-2024-08-20-110455
4.16.0-0.okd-scos-2024-08-21-155613
Description of problem:
Bootstrap process failed due to API_URL and API_INT_URL are not resolvable: Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'. Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up install logs: ... time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host" time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz" time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." ...
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165
How reproducible:
Always.
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. Create cluster 3.
Actual results:
Failed to complete bootstrap process.
Expected results:
See description.
Additional info:
I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969
This duplicate issue was created because openshift/console github bots require a valid CONSOLE Jira to be associated with all PRs.
Description
Migrate Developer View > Observe > Silences Tab code from openshift/console to openshift/monitoring-plugin. This is part of the ongoing effort to consolidate code between the Administrative and Developer Views of the Observe section.
Related Jira Issue
https://issues.redhat.com/browse/OU-257
Related PRs
As a HyperShift service provider, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.
The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.
The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.
See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.
Description of problem:
This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446 ALthough both nodes topologies are equavilent, PPC reported a false negative: Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1.TBD 2. 3.
Actual results:
Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
Expected results:
topologies matches, the PPC should work fine
Additional info:
Description of problem:
Below tests fail on ipv6primary dualstack cluster because the router deployed is not prepared for dualstack:
[sig-network][Feature:Router][apigroup:image.openshift.io] The HAProxy router should serve a route that points to two services and respect weights [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should respond with 503 to unrecognized hosts [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should serve routes that were created from an ingress [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io][apigroup:operator.openshift.io] The HAProxy router should support reencrypt to services backed by a serving certificate automatically [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host for overridden domains with a custom value [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host with a custom value [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should run even if it has no access to update status [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should serve the correct routes when scoped to a single namespace and label set [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
That is confirmed by accessing to the router pod and checking the connectivity locally:
sh-4.4$ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://127.0.0.1/Letter" 200 sh-4.4$ echo $? 0
sh-4.4$ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://fd01:0:0:5::551/Letter" 000 sh-4.4$ echo $? 3
sh-4.4$ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://[fd01:0:0:5::551]/Letter" 000 sh-4.4$ echo $? 7
The default router deployed in the cluster supports dualstack. Hence it's possible and required to update the router image configuration usedin the tests to be able to answer both ipv4 and ipv6.
Version-Release number of selected component (if applicable): https://github.com/openshift/origin/tree/release-4.15/test/extended/router/
How reproducible: Always.
Steps to Reproduce: Run the tests in ipv6primary dualstack cluster.
Actual results: Tests failing as below:
<*errors.errorString | 0xc001eec080>:
last response from server was not 200:
{
s: "last response from server was not 200:\n",
}
occurred
Ginkgo exit error 1: exit with code 1
Expected results: Test passing
Looking at the logs for ironic-python-agent in a preprovisioning image, we get each log message twice - once directly from the agent process and once from podman:
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.834 1 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://10.9.53.20:9999, API version is 1.68 heartbeat /usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.p y:186 Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.834 1 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://10.9.53.20:9999, API version is 1.68 heartbeat /usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py:186 Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53 868b6 could not be found. Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent Traceback (most recent call last): Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent self.api.heartbeat( Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent raise errors.HeartbeatError(error) Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found. Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.867 1 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 5.029721378959369 Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found. Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent Traceback (most recent call last): Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent self.api.heartbeat( Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent raise errors.HeartbeatError(error) Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found. Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.867 1 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 5.029721378959369
This is confusing and unnecessary, especially as the two sets of logs can be interleaved (note also the non-monotonic timestamps in the third column).
The log above actually comes from a ZTP deployment (the one in OCPBUGS-44026), but the IPA configuration even for that ultimately comes from the image-customization-controller.
Currently there is no log driver flag passed to podman so we get the default, which is journald. We have the use_stderr option set in the IPA config so logs get written to stderr, which podman will send to journald. We are also running the podman pod in the foreground, which I think will cause it to also pass the stderr to systemd, which also sends it to the journal.
I believe another side-effect of this misconfiguration is that one lot of logs show up red in journalctl and the other don't. Ideally we would have colour-coded logs by severity. This can be turned on for the stderr logging by setting log_color in the IPA config, but it might be better to enable use-journal instead of use-stderr so we get native logging to journald.
Duplication of Issue from OU Jira board
Duplicate of https://issues.redhat.com/browse/OU-259
The openshift/console CI needs a valid issue on the OpenShift Console Jira board.
Overview
This PR aims to consolidate code for the Developer perspective > Observe > Dashboard page. We will remove the code that renders this page from openshift/console. The page will now be rendered by openshift/monitoring-plugin through this PR: openshift/monitoring-plugin#167.
Testing
Must be tested with PR: openshift/monitoring-plugin#167
This PR #14192 removes the Developer perspective > Observe > Dashboard page
This PR openshift/monitoring-plugin#167 adds the Developer perspective > Observe > Dashboard page
Expected Results: All behaviors should be the same as before the migration.
The goal is to collect metrics about some features used by the Tempo operator because this will be useful for improve the product, knowing which features the customer use we can focus our efforts better on improve those features.
tempo_operator_tempostack_multi_tenancy gauge that represents the number of TempoStack instances that uses tempo_operator_tempostack_multi_tenancy
Labels
tempo_operator_tempostack_managed gauge that represent the number of TempoStack instances managed/unmanaged
Labels
tempo_operator_tempostack_jaeger_ui gauge that represent the number of TempoStack instances true/false
Labels
tempo_operator_tempostack_storage_backend gauge that represent the number of TempoStack instances that uses certain storage type
Labels
Description of problem:
Customer wants to boot a VM using the Assisted Installer ISO. The customer originally installed the OpenShift Container Platform cluster using version 4.13, however in the meantime the cluster was updated to 4.16. As a result, the customer updated the field "osImageVersion" to "4.16.11". This lead to the new ISO being generated as expected. However, when reviewing the "status" of the InfraEnv, they can still see the following URL: ~~~ isoDownloadURL: 'https://assisted-image-service-multicluster-engine.cluster.example.com/byapikey/<REDACTED>/4.13/x86_64/minimal.iso' ~~~ Other artifacts are also still showing "?version=4.13": ~~~ kernel: 'https://assisted-image-service-multicluster-engine.cluster.example.com/boot-artifacts/kernel?arch=x86_64&version=4.13' rootfs: 'https://assisted-image-service-multicluster-engine.cluster.example.com/boot-artifacts/rootfs?arch=x86_64&version=4.13' ~~~ Workaround is to downloading the ISO by replacing the version and works as expected.
Version-Release number of selected component (if applicable):
RHACM 2.10 OpenShift Container Platform 4.16.11
How reproducible:
Always at customer side
Steps to Reproduce:
1. Create a cluster with an InfraEnv with the "osImageVersion" set to 4.14 (or 4.13) 2. Update the cluster to the next OpenShift Container Platform version 3. Update the InfraEnv "osImageVersion" field with the new version (you may need to create the ClusterImageSet)
Actual results:
URLs in the "status" of the InfraEnv are not updated with the new version
Expected results:
URLs in the "status" of the InfraEnv are updated with the new version
Additional info:
* Discussion in Slack: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1726483803662359
Description of problem:
It's not possible to either - create RWOP PVC - Create RWOP clone - Restore to RWOP PVC using 4.16.0/4.17.0 UI with ODF StorageClasses. Please see the attached print screen. The RWOP access mode should be added to all the relevant screens in the UI.
Version-Release number of selected component (if applicable):
OCP 4.16.0 & 4.17.0 ODF (OpenShift Data Foundation) 4.16.0 & 4.17.0
How reproducible:
Steps to Reproduce:
1. Open UI, go to OperatorHub 2. Install ODF, once installed refresh for ConsolePlugin to get populated 3. Go to operand "StorageSystem" and create the CR using the custom UI (you can just keep on clicking "Next" with the default selected options, it will work well on AWS cluster) 5. Wait for "ocs-storagecluster-cephfs" and "ocs-storagecluster-ceph-rbd" StorageClasses to get created by ODF operator 6. Go to PVC creation page, try to create new PVC (using StorageClasses mentioned in step 5) 7. Try to create clone 8. Try to restore PVC to RWOP pvc from existing snapshot
Actual results:
It's not possible to create RWOP PVC, not possible to create RWOP clone and to restore to RWOP PVC from a snapshot using 4.16.0 & 4.17.0 UI.
Expected result:
It should be possible to create RWOP PVC, to create RWOP clone and to restore to a RWOP snapshot from PVC
Additional info:
https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L111-L119 >> these needs to be updated
Description of problem:
Spun out of https://issues.redhat.com/browse/OCPBUGS-38121, we noticed that there were logged requests against a non-existent certificatesigningrequests.v1beta1.certificates.k8s.io API in 4.17. These requests should not be logged if the API doesn't exist. See also slack discussion https://redhat-internal.slack.com/archives/C01CQA76KMX/p1724854657518169
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Orange is complains the log errors, though they do not cause actual problems:
How reproducible:
every time when agentinstalladmission starts
Steps to reproduce:
1. in a k8s cluster install infra operator
2. install agentconfig CR,
kind: AgentServiceConfig metadata: name: agent spec: ingress: className: nginx assistedServiceHostname: assisted-service.example.com imageServiceHostname: image-service.example.com databaseStorage: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi filesystemStorage: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi imageStorage: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi osImages: [{"openshiftVersion":"4.17.0","cpuArchitecture":"x86_64","url":"https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/4.17.0-ec.3/rhcos-4.17.0-ec.3-x86_64-live.x86_64.iso","version":"4.17.0"}]
3. check the agentinstalladmission container log
Actual results:
some error show up in the log
Expected results:
log free of errors