Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Feature Overview
Insights Advisor for OpenShift is integrated within OpenShift Cluster Manager. This has some limitations for adding new features and also for sharing codebase between RHEL Advisor and OCM Insights Advisor tab. Insights Advisor for OpenShift lacks certain features from the RHEL UI, the codebase is not 1:1 clone.
As a customer of Insights I will have same/very similar user experience with Insights for OpenShift and Insights for RHEL. The workflows will share the main concepts, the UI elements will be same and features introduced to Advisor will be automatically considered for both all supported platforms.
As OpenShift users I will still see integrations of Insights Advisor within OpenShift Cluster Manager that shows aggregated information for customer account and single cluster view on Advisor data. These integration will point to new Insights Advisor for OpenShift app that will be tightly integrated into OpenShift Cluster Manager.
Goals
Requirements
Benefits
Questions to answer...
Out of Scope
Background, and strategic fit
Documentation Considerations
OCP WebConsole, in the main dashboard, has an Insights Advisor widget, which has been redirecting users to OCM. Due to the Insights Advisor tab decommission in OCM, the links should point to Advisor instead.
4.10 code freeze = 28 January (marking the task as urgent)
Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.
That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.
We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.
Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.
Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.
Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.
Team A wants to send all their important notifications to a specific Slack channel.
As described in https://github.com/openshift/enhancements/blob/ba3dc219eecc7799f8216e1d0234fd846522e88f/enhancements/monitoring/multi-tenant-alerting.md#distinction-between-platform-and-user-alerts, cluster admins want to distinguish platform alerts from user alerts. For this purpose, CMO should provision an external label (openshift_io_alert_source="platform") on prometheus-k8s instances.
Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.
Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.
See Operators & STS slide deck.
The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.
This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.
This Section: High-Level description of the Market Problem ie: Executive Summary
This Section: Articulates and defines the value proposition from a users point of view
This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.
As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.
Acceptance Criteria:
Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer using OpenShift
I want to mount a Simple Content Access certificate into my build
So that I can access RHEL content within a Docker strategy build.
As a application developer or administrator
I want to share credentials across namespaces
So that I don't need to copy credentials to every workspace
As a cluster admin
I want the cluster storage operator to install the shared resources CSI driver
So that I can test the shared resources CSI driver on my cluster
Docs will need to identify how to install the shared resources CSI driver (by enabling the tech preview feature set)
Tasks:
Note that to be able to test all of this on any cloud provider, we need STOR-616 to be implemented. We can work around this by making the CSI driver installable on AWS or GCP for testing purposes.
The cluster storage operator has cluster-admin permissions. However, no other CSI driver managed by the operator includes a CRD for its API.
As an OpenShift engineer
I want to know which clusters are using the Shared Resource CSI Driver
So that I can be proactive in supporting customers who are using this tech preview feature
None - metrics exported to telemetry are not formally documented.
QE can verify that the query/recording rule for cluster monitoring operator returns data if the cluster has the Shared Resource CSI driver installed and utilizes a SharedSecret or SharedConfigMap in a pod/workload.
Insights rules can potentially be created off of these exported metrics. This would allow CEE to identify which clusters are using SharedSecrets or SharedConfigMaps, especially if we are exporting mount failure metrics.
To implement, a prometheus query/recording rule needs to be added to the cluster monitoring operator. Once approved by the monitoring team, the metric data will be available on DataHub once 4.10 clusters are installed with the updated version of the monitoring operator.
Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change
Bring together all the cloud controller managers (AWS, GCP, Azure), complete testing and prepare for final GA
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Initial work was started there: https://github.com/lobziik/cluster-cloud-controller-manager-operator/pull/1/files
Need to isolate provider specific code in respective packages and introduce interface to leverage it (regular and bootstrap manifests rendering should be there atm)
DoD:
This Feature is a general "catch all" for the time being. There are a number of existing priorities from Q1 that should be aligned with existing priorities below but if not, assign to this feature as needed.
In order to get a better overall portfolio view, we'll leverage this Feature to gather work that doesn't fall into other existing priorities on this board. As this list grows, the portfolio priority grooming team will look to split out or handle appropriately.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp |
---|---|---|
< How will the user interact with this feature? >
< Which users will use this and when will they use it? >
< Is this feature used as part of current user interface? >
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
<What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>
<If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
Question | Outcome |
Console provides support UI for operators which is dynamically enabled when the operator is installed; by using feature flags against presence of CRDs. While operators have their own release cadence separately from OpenShift which makes for alignment of UI to API difficult. As new features are released for the operator, the UI becomes out of sync with APIs and customers must wait till the following OpenShift release to get any new UI.
Console extensions:
https://docs.google.com/document/d/1HW5_cl6cOX5P14PQN-1_8c60o9dMY6HbFDRftH6aTno/edit
Dynamic Plugins:
https://docs.google.com/document/d/19BAFo_8BtMZVvKsU-bE61bZpSydeYONkCMWntMU9NgE/edit
Enhancement proposal:
https://github.com/openshift/enhancements/pull/441
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.
The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:
Requirement | Notes | isMvp? |
---|---|---|
UI to enable and disable plugins | YES | |
Dynamic Plugin Framework in place | YES | |
Testing Infra up and running | YES | |
Docs and read me for creating and testing Plugins | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Documentation Considerations
Questions to be addressed:
Currently, webpack tree shakes PatternFly and only includes the components used by console in its vendor bundle. We need to expose all of the core PatternFly components for use in dynamic plugin, which means we have to disable tree shaking for PatternFly. We should expose this as a separate bundle. This will allow browsers to cache more efficiently and only need to load the PF bundle again when we upgrade PatternFly.
Open Questions
What parts of PatternFly do we consider core?
Acceptance Criteria
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
As a user, I want the ability to run a pod in debug mode.
This should be the equivalent of running: oc debug pod
Acceptance Criteria for MVP
Assets
Designs (WIP): https://docs.google.com/document/d/1b2n9Ox4xDNJ6AkVsQkXc5HyG8DXJIzU8tF6IsJCiowo/edit#
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview
Enable customers to access Google services from workloads on OpenShift clusters using Google Workload Identity (aka WIF)
https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Dependencies (internal and external)
We need to ensure following things in the openshift operators
1) Make sure to operator uses v0.0.0-20210218202405-ba52d332ba99 or later version of the golang.org/x/oauth2 module
2) Mount the oidc token in the operator pod, this needs to go in the deployment. We have done it for cluster-image-registry-operator here
3) For workload identity to work, gco credentials that the operator pod uses should be of external_account type (not service_account). The external_account credentials type have path to oidc token along, url of the service account to impersonate along with other details. These type of credentials can be generated from gcp console or programmatically (supported by ccoctl). The operator pod can then consume it from a kube secret. Make appropriate code changes to the operators so that can consume these new credentials
Following repos need one or more of above changes
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Update console from Cypress 6.0.0 to 8.5.0. Changes that impact us:
https://docs.cypress.io/guides/references/migration-guide#Migrating-to-Cypress-8-0
Update webpack to the latest 4.x and update webpack loaders. This will help prepare us to move to webpack 5.
As an adopter of the @openshift-console/dynamic-plugin-sdk I want to easily integrate into my development pipeline so that I can extend the OCP console.
Trying to pull in the dynamic-plugin-sdk into ACM is proving to be problematic. We would have to move to older dependencies. Integrating with webpack and typescript requires a very specific setup.
The dynamic-plugin-sdk has only really been used internally by OCP and is strongly tied to the setup and dependencies of OCP. For the dynamic-plugin-sdk to be externally consumable by adopters, it should be as easy to use as other webpack plugins such as HtmlWebpackPlugin or CompressionPlugin.
The console has many instances of old variables, $grid-float-breakpoint and $grid-gutter-width, controlling margins/padding and responsive breakpoints throughout the Admin and Dev Console. These do not provide spacing and behaviors consistent with Patternfly components which use their own variables, $pf-global-gutter-md, $pf-global-gutter, and $pf-global-breakpoint-{size}. By replacing these, the intent it to bring the console closer to a pure Patternfly structure and behavior, requiring less overrides and customizations.
In the image-registry, we have packages origin-common and kubernetes-common. The problem is that this code doesn't get updates. We can replace them with more supported library-go.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer using Jenkins to build my application
I want to use the base Jenkins agent image as a sidecar in my PodTemplate
So that I can use any s2i builder image in my Jenkins pipelines
QE will need to verify that the new pod templates can successfully execute a JenkinsPipeline build.
Documentation needs to be updated to explain how to use the new template.
Unclear if we need new CEE/PX materials beyond doc updates.
We currently have built-in pod templates for NodeJS and Maven, which use specialized agent images with NodeJS/Maven image.
Blog post here outlines the process: https://developers.redhat.com/blog/2020/06/04/an-easier-way-to-create-custom-jenkins-containers/
The Groovy style of declaring in-line pod templates is deprecated in favor of a YAML-style format.
Existing documentation for the Jenkin pod templates: https://docs.openshift.com/container-platform/4.9/openshift_images/using_images/images-other-jenkins.html#images-other-jenkins-config-kubernetes_images-other-jenkins
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
After investigating a complex Bugzilla involving many applications making queries to prometheus-adapter, we've noticed that we were lacking insights on the requests made to prometheus-adapter. To have such information for an aggregated API, the best would be to have audit logs for prometheus-adapter. This wasn't configurable before, but with https://github.com/kubernetes-sigs/custom-metrics-apiserver/pull/92, upstream users should now be able to configure it.
Since this would greatly help in investigating prometheus-adapter Bugzilla in the future, it would be great if we allowed OpenShift users to configure the audit logs so that they could provide them to us.
Note for the assignee: as of the time of the creation of this ticket, the upstream PR hasn't been merged in custom-metrics-apiserver and thus wasn't synced in prometheus-adapter. So we will have to wait a bit before starting looking into this ticket.
DoD:
The console requires to know the network type capabilities to show/hide some Network Policy form fields.
As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.
However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.
This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.
Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875
We want to configure 'default' and 'allowed' values in validation webhook for Guest Accelerators field in GCPProviderSpec. Also revendor it to include newly added Guest Accelerators field.
This can be done after https://github.com/openshift/cluster-api-provider-gcp/pull/172 is merged.
DoD:
Description:
Openshift on RHV is composed of the following subproject the team maintains:
Each of those projects currently uses the generated oVirt API project go-ovirt.
This leads to a number of issues:
Then came go-ovirt-client, go-ovirt-client-log, go-ovirt-client-log-klog and k8sOVirtCredentialsMonitor to the rescue!
The go-ovirt-client is a wrapper around the go-ovirt which contains all the error handling/retry logic/logs/tests needed to provide a decent user experience and an easy-to-use API to the oVirt engine.
go-ovirt-client-log is a library to unify the logging logic between the projects, it is used by go-ovirt-client and should be used by all the sub-projects.
go-ovirt-client-log-klog is a companion library to go-ovirt-client-log enabling logging via the Kubernetes "klog" facility.
k8sOVirtCredentialsMonitor is a utility for monitoring the oVirt credentials secret, which will automatically update the ovirt credentials is they are changed.
We aim to move all projects which are using the go-ovirt to use go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor instead.
Benefits for the eng:
Benefits for the customers:
Acceptance criteria:
How to test:
Description:
Acceptance:
ovirt-csi-driver uses go-ovirt-client for 95% percent of all oVirt related logic.
T-shirt size: M
Provide an easy and successful experience for front end developers to build and deploy their applications
Currently, the front end dev experience is not positive. It's much easier for them to use other platforms. Improving the front end dev experience will enable us to gain more marketshare
Although we provide the ability for 2 & 3 today, the current journey does not match with the mental model of the front end developer
Desired UX experience
As a user, I want have the option to add additional labels to a Route, as I could do in OCP3. See RFE-622
The additional labels should only be added to the route, not the service or other components. The advanced option "Labels" should not be touched and these labels are added to all components.
As an small additional we should also show always the "Target port" since it also defines the Service port and to make this more clear, the "Target port" should be shown before the "Create a route to the Application" checkbox.
The following changes should be applied to the Import flow (from Git, from Container, ...) and to the Edit page as well:
This epic is mainly focused on the 4.10 Release QE activities
1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis
To the track the QE progress at one place in 4.10 Release Confluence page
There are different code spots which maps the old action items "From Git", "From Dockerfile" and "From Devfile" to the new action "Import from Git".
We should avoid mapping different strings to the new version and instead update our tests so that the feature and page object files matches the latest frontend code.
Code areas I found are marked with
// TODO (ODC-6455): Tests should use latest UI labels like "Import from Git" instead of mapping strings
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Please read: migrating-protractor-tests-to-cypress
Protractor test to migrate: `frontend/integration-tests/tests/oauth.scenario.ts`
Large but straight forward
47) OAuth 48) BasicAuth IDP ✔ creates a Basic Authentication IDP ✔ shows the BasicAuth IDP on the OAuth settings page 49) GitHub IDP ✔ creates a GitHub IDP ✔ shows the GitHub IDP on the OAuth settings page 50) GitLab IDP ✔ creates a GitLab IDP ✔ shows the GitLab IDP on the OAuth settings page 51) Google IDP ✔ creates a Google IDP ✔ shows the Google IDP on the OAuth settings page 52) Keystone IDP ✔ creates a Keystone IDP ✔ shows the Keystone IDP on the OAuth settings page 53) LDAP IDP ✔ creates a LDAP IDP ✔ shows the LDAP IDP on the OAuth settings page 54) OpenID IDP ✔ creates a OpenID IDP ✔ shows the OpenID IDP on the OAuth settings page
Accpetance Criteria
As a follow up to OCPCLOUD-693, we need to, once all of the API definitions are present in openshift/api, migrate the existing code bases to use the new API locations.
This will include:
Complete all the 4.9 epic features automation user stories and merge it to master branch.
4.9 epics automation completion
Tech debt should be completed
Create the pr's for 4.9 epic user stories automation
Review it
Merge it to 4.10 master branch and 4.9 master branch
As a user, I want to store my delivery pipelines in a Git repository as the source of truth and execute the pipeline on OpenShift on Git events, so that I can version and trace changes to the delivery pipelines in Git.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
This is a clone of issue OCPBUGS-10943. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10661. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:
—
Description of problem:
Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build. This is resulting in all jobs invoked by the 4.12 nightlies failing to install.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-03-13-172313 and later
How reproducible:
consistently in 4.12 nightlies only(ci builds do not seem to be impacted).
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log
Description of problem:
After upgrading the cluster to version 4.10.32 the web console and metrics tab does not display any data. we have to restart the thanos pods to get the functionality back to running state Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
NA Steps to Reproduce:
NA
Actual results:
The data is not getting displayed and monitoring of the cluster is hampered
Expected results:
The data should be displayed properly without having to restart the pods
Additional info:
This is a clone of issue OCPBUGS-3300. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3265. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3172. The following is the description of the original issue:
—
Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".
We shouldn't block operator installation if permission issues prevent dynamic plugin installation.
This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".
See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD
This is a clone of issue OCPBUGS-7510. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7373. The following is the description of the original issue:
—
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000
Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.
This is particularly pronounced in SNO topologies, leading to installation failures.
The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.
The static authorizer feature has landed in upstream kube-rbac-proxy. Lets use it by configuring a static authorizer for all requests that hit a /metrics endpoint.
DoD:
+++ This bug was initially created as a clone of Bug #2117423 +++
Description of problem:
Backport https://github.com/openshift/kubernetes/pull/1295 to 4.10
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Description of problem:
When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update. However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.
Version-Release number of selected component (if applicable):
4.10.24
How reproducible:
Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.
Steps to Reproduce:
1. 2. 3.
Actual results:
Port still exists in OVN, IP remains allocated for a deleted pod.
Expected results:
IP should be freed, port should be removed from OVN.
Additional info:
Similar bug for completed pods with "err: the range is full" fixed in 4.10.21 https://bugzilla.redhat.com/show_bug.cgi?id=2091157
This is a clone of issue OCPBUGS-268. The following is the description of the original issue:
—
The linux kernel was updated:
https://lkml.org/lkml/2020/3/20/1030
to include steal
accounting
This would greatly assist in troubleshooting vSphere performance issues
caused by over-provisioned ESXi hosts.
This is a clone of issue OCPBUGS-723. The following is the description of the original issue:
—
Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~
WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.
In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).
This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.
Version-Release number of selected component (if applicable): OCP 4.9
How reproducible: Occurred only twice(once in test and in current cluster)
Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty
Actual results: ClusterQuota should display the values
Expected results: ClusterQuota not displaying values
This is a clone of issue OCPBUGS-4945. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4805. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:
—
Description of problem:
We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running. One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby /etc/node-sizing.env on its master nodes contained an empty SYSTEM_RESERVED_ES value: --- cat /etc/node-sizing.env SYSTEM_RESERVED_MEMORY=5.36Gi SYSTEM_RESERVED_CPU=0.11 SYSTEM_RESERVED_ES= --- causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted. We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade. A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. For clusterB the conditions are more well-known of why the value is empty. However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. We have some asks as a result: - Can MCO be made to recover from this situation if it occurs, perhaps through application of a safe default if none exists, such that kubelet would start correctly? - Can there possibly be alerting that could indicate and draw attention to the misconfiguration?
Version-Release number of selected component (if applicable):
4.11.17
How reproducible:
Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17
Expected results:
If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.
Additional info:
Description of problem:
When creating a incomplete ClusterServiceVersion resource the OLM details page crashes (on 4.11).
apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion metadata: name: minimal-csv namespace: christoph spec: apiservicedefinitions: owned: - group: A kind: A name: A version: v1 customresourcedefinitions: owned: - kind: B name: B version: v1 displayName: My minimal CSV install: strategy: ''
Version-Release number of selected component (if applicable):
Crashes on 4.8-4.11, work fine from 4.12 onwards.
How reproducible:
Alway
Steps to Reproduce:
1. Apply the ClusterServiceVersion YAML from above
2. Open the Admin perspective > Installed Operator > Operator detail page
Actual results:
Details page crashes on tab A and B.
Expected results:
Page should not crash
Additional info:
Thi is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2084287
This is a clone of issue OCPBUGS-515. The following is the description of the original issue:
—
When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.
This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.
How reproducible: 100%
Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi
$ oc create -f storage-claim.yaml
2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)
Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.
Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).
Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139
This is a clone of issue OCPBUGS-7127. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5016. The following is the description of the original issue:
—
Description of problem:
When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel -> Actions -> Edit Pipeline -> YAML view
Actual results:
displayed content is incorrect.
Expected results:
Get the content of the current pipeline, not the "pipeline create" content.
Additional info:
If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content. ~ Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel -> Actions -> Edit Pipeline -> YAML view :Display resource content normally ~
Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:
level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []
Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.
I'm setting the priority to critical as it's causing all our jobs to fail in master.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5947. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4833. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4819. The following is the description of the original issue:
—
Description of problem:
s2i/run script has a bug - /usr/libexec/s2i/run: line 578: [: too many arguments
Version-Release number of selected component (if applicable):
v4.10
How reproducible:
Start jenkins from the container image using /usr/libexec/s2i/run while having a route that contains a certificate or key that includes special characters.
Steps to Reproduce:
1. create a route that contains a TLS certificate 2. start a pod using openshift4/ose-jenkins:v4.10.0 3. view the log
Actual results:
2022/12/12 17:30:33 [go-init] No pre-start command defined, skip 2022/12/12 17:30:33 [go-init] Main command launched : /usr/libexec/s2i/run CONTAINER_MEMORY_IN_MB='12288', using /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/java and /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/javac Administrative monitors that contact the update center will remain active Migrating slave image configuration to current version tag ... /usr/libexec/s2i/run: line 578: [: too many arguments Using JENKINS_SERVICE_NAME=jenkins Generating jenkins.model.JenkinsLocationConfiguration.xml using (/var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml.tpl) ... Jenkins URL set to: https://bojenkinsdev.micron.com/ in file: /var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml + exec java -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:+ParallelRefProcEnabled -XX:+UseG1GC -XX:+UseStringDeduplication -XX:HeapDumpPath=/var/log/jenkins/ '-Xlog:gc*=debug:file=/var/log/jenkins-engserv/gc-%t.log:utctime:filecount=2,filesize=100m' -Xms2g -Xmx8g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Djavax.net.ssl.trustStore=/var/lib/jenkins/ca-anchors-keystore -Dcom.redhat.fips=false -Djdk.http.auth.tunneling.disabledSchemes= -Djdk.http.auth.proxying.disabledSchemes= -Duser.home=/var/lib/jenkins -Djavamelody.application-name=jenkins -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=false -XX:+AlwaysPreTouch -XX:ErrorFile=/var/log/jenkins-engserv -Dhudson.model.ParametersAction.keepUndefinedParameters=false -jar /usr/lib/jenkins/jenkins.war --prefix=/je... Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true
Expected results:
not have the following error: /usr/libexec/s2i/run: line 578: [: too many arguments
Additional info:
This is a clone of issue OCPBUGS-10225. The following is the description of the original issue:
—
Description of problem:
Pipeline Repository (Pipeline-as-code) list never shows an Event type.
Version-Release number of selected component (if applicable):
4.9+
How reproducible:
Always
Steps to Reproduce:
Actual results:
Pipeline Repository list shows a column Event type but no value.
Expected results:
Pipeline Repository list should show the Event type from the matching Pipeline Run.
Similar to the Pipeline Run Details page based on the label.
Additional info:
The list page packages/pipelines-plugin/src/components/repository/list-page/RepositoryRow.tsx renders obj.metadata.namespace as event type.
I believe we should show the Pipeline Run event type instead. packages/pipelines-plugin/src/components/repository/RepositoryLinkList.tsx uses
{plrLabels[RepositoryLabels[RepositoryFields.EVENT_TYPE]]}to render it.
Also the Pipeline Repository details page tried to render the Branch and Event type from the Repository resource. My research says these properties doesn't exist on the Repository resource. The code should be removed from the Repository details page.
Description of problem:
Customer is facing issue similar to https://github.com/devfile/api/issues/897
Version-Release number of selected component (if applicable):
OCP 4.10.17
How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA
The OWNERS file for multiple branches in the openshift/jenkins repository need to be updated to reflect current team members for approvals.
Description of problem:
Unit-tests flaking on 4.10 PRs
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Unit-test job fails with the following error: [Fail] OVN Pod Operations during execution [It] should not deallocate in-use and previously freed completed pods IP /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/pods_test.go:560 Ran 153 of 153 Specs in 215.549 seconds FAIL! -- 152 Passed | 1 Failed | 0 Pending | 0 Skipped --- FAIL: TestClusterNode (216.05s) FAIL github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn 228.843s {"component":"entrypoint","error":"wrapped process failed: exit status 2","file":"k8s.io/test-infra/prow/entrypoint/run.go:79","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-12-17T0
Expected results:
All tests pass
Additional info:
This is a clone of issue OCPBUGS-4410. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4311. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:
—
Description of problem:
Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level
https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
Version-Release number of selected component (if applicable): none
How reproducible: Every time
Steps to Reproduce:
Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
It is enabled by default and there is no way to disable it or reduce log level
Actual results:
Please check Case: 03371411, the log file grew to 409 GB
Expected results: Need a way to disable debug
Additional info: Case 03371411. A cluster must gather and log file can be found in the case.
This is a clone of issue OCPBUGS-5078. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5019. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:
—
Description of problem: This is a follow-up to OCPBUGS-3933.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-613. The following is the description of the original issue:
—
The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.
The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).
The visible results of that are:
4.10
Always if there are static pods.
1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).
error: errors occurred while gathering data: one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd [one or more errors occurred while gathering container data for pod etcd-master-0.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net: the server could not find the requested resource]
No errors like the ones above and rotated pod logs to be gathered, if present.
Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.
This is a manual clone of https://bugzilla.redhat.com/show_bug.cgi?id=2093597 to backport this to 4.10.
Description of problem:
When importing a component from git or from a container image, and open one or more advanced options, the sentence "Click on the names to access advanced options for ..." is splited into two parts. And the headlines have no padding and everything looks squashed.
Version-Release number of selected component (if applicable):
4.10+
How reproducible:
Always
Steps to Reproduce:
1. Switch to dev perspective
2. Navigate to the add page > Import from container
3. Scroll down and open one or more of the advanced options
Actual results:
1. The sentence "Click on the names to access advanced options for ..." is shown before the opened option. The other available options are shown below the selected option.
2. The headline is displayed directly below "Click on the names to access advanced options for"
3. Another section is also shown directly under the first one.
Expected results:
1. The sentence "Click on the names to access advanced options for ..." and the options should be "one sentence" again.
2+3. Some padding for the header and/or between the sections, similar to 4.9. It must not look exactly as in 4.9, but there should be some padding between independent sections.
Additional info:
none
Description of problem:
Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list 1. all kinds of add/delete port to/from default deny port group failures, possible symptoms: - port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed - port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped - db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done 2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed 3. handle error when getting local pod from the cache fails, possible symptoms - "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message - pod is not added to netpol port groups, network policy is not applied 4. creating deleted namespace via ensureNamespaceLocked, symptoms: - namespace was deleted, but address set is present in the db 5. policy acl loglevel update wasn’t applied, possible symptoms: - netpol acl log level isn’t set/updated to namespace loglevel 6. netpol cleanup failures, symptoms: - network policy failed to be deleted, something is still left in the db, error messages like - "failed to destroy network policy" - "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy" 7. concurrent write to sets.String - this will panic, you won’t miss 8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g. - "peer AddressSet is nil, cannot add <object>" 9. host network and completed pods selected by network policy can produce error logs, no real harm - "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err" 10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak 11. add local pod failure, since netpol port group is not committed to db yet, error looks like - "Failed to create *factory.localPodSelector <name>, error: object not found"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Example 1 1. Create network policy with [in/e]gress selector that applies to a namespace labeled project: myproject apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: test-network-policy namespace: test spec: podSelector: {} policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: project: myproject 2. Use oc apply to delete network policy and crate a pod in project: myproject namespace at the same time 3. check ovnkube-master logs for "peer AddressSet is nil, cannot add peer pod(s)", this should retry with the same error 15 times 4. This may not work from the first try, since we need to hit specific order of network policy delete and pod add handling 5. With the new version no error messages should be present Example 2 1. create network policy that applies to a namespace test piVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: test-network-policy namespace: test spec: podSelector: {} policyTypes: - Ingress ingress: 2. Create host network pod in namespace test 3. Check 15 logs saying "Failed to get LSP for pod %s/%s for networkPolicy %s refetching err: " 4. check final log "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" 5. With the new version no error message should be present All the other cases are difficult to reproduce, maybe just running some standard network policy tests and making sure everything works will be a good verification.
Actual results:
Expected results:
Additional info:
Only some parts were backported to 4.10 due to significant releases differences. The problems that are fixed, and perf improvement: 1. don't retry unscheduled pod, wait for update event instead. 2. Cleanup pod handler for namespaceAndPod handler on namespace delete event. 3. Only update localPods after successful db transaction, return fast from localPod handlers based on `np.localPods` 4. Don't retry fetching lsp from the lspCache, that "stops" the handler for 1 second 5. Use stored portUUID to delete local pods instead of getting that info from lspCache
As per [1], the jsonnet code for managing thanos-ruler resources should reuse the upstream kube-thanos project.
This is a clone of issue OCPBUGS-5258. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:
—
Description of problem:
It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.
Version-Release number of selected component (if applicable):
Openshift 4.8, Serverless Operator 1.27
Additional info:
https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019
Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988
This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.
Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D
Version-Release number of selected component (if applicable):
4.12,4.13
How reproducible:
Every time
Steps to Reproduce:
1. Setup dualstack KinD cluster 2. Create egress fw policy with spec Spec: Egress: To: Cidr Selector: 0.0.0.0/0 Type: Deny 3. create a pod and ping to 1.1.1.1
Actual results:
Egress policy does not block flows to external IP
Expected results:
Egress policy blocks flows to external IP
Additional info:
It seems mixing ip4 and ip6 operands in ACL matchs doesnt work
Description of problem:
The IPI installation in some regions got bootstrap failure, and without any node available/ready.
Version-Release number of selected component (if applicable):
12-22 16:22:27.970 ./openshift-install 4.12.0-0.nightly-2022-12-21-202045 12-22 16:22:27.970 built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9 12-22 16:22:27.970 release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270 12-22 16:22:27.971 release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing)
Actual results:
Bootstrap failed to complete
Expected results:
Installation in those regions should succeed.
Additional info:
FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/ No any node available/ready, and no any operator available. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 30m Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors $ oc get nodes No resources found $ oc get machines -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE jiwei-1222f-v729x-master-0 30m jiwei-1222f-v729x-master-1 30m jiwei-1222f-v729x-master-2 30m $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager cloud-credential cluster-autoscaler config-operator console control-plane-machine-set csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage $ Mater nodes don't run for example kubelet and crio services. [core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" [core@jiwei-1222f-v729x-master-0 ~]$ The machine-config-daemon firstboot tells "failed to update OS". [jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors. Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors. Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info> [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn> [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused ...repeated logs omitted... Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found. Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found. Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found. Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found. Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770 2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0 Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425 2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units. Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684 2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0 Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'. Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube. Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'. [jiwei@jiwei log-bundle-20221222085846]$
Description of problem:
Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':
test-ip-removal-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "port1", "namespace": "default" } ] test-ip-removal-net-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "net-port1", "namespace": "default" } ]
2. IP allocated in the IPPool:
kind: IPPool ... spec: allocations: "16": id: ... podref: test-ecoloma-1/test-ip-removal-port1 "17": id: ... podref: test-ecoloma-1/test-ip-removal-net-port1
3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:
[13:29][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 14m 11d
[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus 2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
allocations:
"17":
id: ...
podref: test-ecoloma-1/test-ip-removal-net-port1
range: 2001:1b70:820d:2610::/64
[13:30][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 9s 11d
Actual results:
The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.
Expected results:
The network interface must not be removed, regardless of the name.
Additional info:
Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94
This is a clone of issue OCPBUGS-6622. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6018. The following is the description of the original issue:
—
This is a public clone of OCPBUGS-3821
The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:
This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.
This is a clone of issue OCPBUGS-11465. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11404. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11333. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:
—
Description of problem:
according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour
$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 60 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3 ... $ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20 startupProbe: exec: command: - sh - -c - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi failureThreshold: 240 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 3
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-19-052243
How reproducible:
always
Steps to Reproduce:
1. enable UWM, check startupProbe for UWM prometheus/platform prometheus 2. 3.
Actual results:
startupProbe for UWM prometheus is still 15m
Expected results:
startupProbe for UWM prometheus should be 1 hour
Additional info:
since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.
Description of problem:
intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy
Version-Release number of selected component (if applicable):
OpenShift 4.10.12
How reproducible:
Always
Steps to Reproduce:
1. Define deny all network policy for egress an ingress in a namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
2. Define the following network policy to allow the traffic between the pods in the namespace:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-intra-namespace-001 spec: egress: - to: - podSelector: {} ingress: - from: - podSelector: {} podSelector: {} policyTypes: - Ingress - Egress
3. Test the connectivity between two pods from the namespace.
Actual results:
The connectivity is not allowed
Expected results:
The connectivity should be allowed between pods from the same namespace.
Additional info:
After performing a test and analyzing SDN flows for the namespace:
sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]
We see that the following rule doesn't match because `reg1` hasn't been defined:
cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
This bug is a backport clone of [Bugzilla Bug 2072040](https://bugzilla.redhat.com/show_bug.cgi?id=2072040). The following is the description of the original bug:
—
Description of problem:
configure-ovs resets network configuration on boot, and while doing so waits for all devices to become unmanaged in NetworkManager. Currently the patch port between br-int and br-ex created by ovn-kubernetes is managed by NetworkManager, never becomes unmanaged and causes an accumulated delay of 2 minutes on boot.
This patct port should never be managed by NetworkManager.
Origin tests for the bond-cni
Backport of https://github.com/openshift/origin/pull/27405
This is a clone of issue OCPBUGS-13792. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-13739. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-13692. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-13549. The following is the description of the original issue:
—
Description of problem:
Incorrect AWS ARN [1] is used for GovCloud and AWS China regions, which will cause the command `ccoctl aws create-all` to fail: Failed to create Identity provider: failed to apply public access policy to the bucket ci-op-bb5dgq54-77753-oidc: MalformedPolicy: Policy has invalid resource status code: 400, request id: VNBZ3NYDH6YXWFZ3, host id: pHF8v7C3vr9YJdD9HWamFmRbMaOPRbHSNIDaXUuUyrgy0gKCO9DDFU/Xy8ZPmY2LCjfLQnUDmtQ= Correct AWS ARN prefix: GovCloud (us-gov-east-1 and us-gov-west-1): arn:aws-us-gov AWS China (cn-north-1 and cn-northwest-1): arn:aws-cn [1] https://github.com/openshift/cloud-credential-operator/pull/526/files#diff-1909afc64595b92551779d9be99de733f8b694cfb6e599e49454b380afc58876R211
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-05-11-024616
How reproducible:
Always
Steps to Reproduce:
1. Run command: `aws create-all --name="${infra_name}" --region="${REGION}" --credentials-requests-dir="/tmp/credrequests" --output-dir="/tmp"` on GovCloud regions 2. 3.
Actual results:
Failed to create Identity provider
Expected results:
Create resources successfully.
Additional info:
Related PRs: 4.10: https://github.com/openshift/cloud-credential-operator/pull/531 4.11: https://github.com/openshift/cloud-credential-operator/pull/530 4.12: https://github.com/openshift/cloud-credential-operator/pull/529 4.13: https://github.com/openshift/cloud-credential-operator/pull/528 4.14: https://github.com/openshift/cloud-credential-operator/pull/526
When using cluster scaling and querying a URL in a pod on the side. All the while running some custom watches on endpoints and nodes.
When the nodes scale down, it seems a few seconds before an event marks the node as Not Ready and before the dns-default endpoint is removed from the endpoints list a DNS query can fail.
We wrote some simply watcher (see below for details) to log this and got the following events:
DNS lookup failure:
Tue Oct 18 12:33:23 UTC 2022 - Lookup success Tue Oct 18 12:33:28 UTC 2022 - DNS failure Tue Oct 18 12:33:41 UTC 2022 - Lookup success
The node was not yet Not Ready and the endpoint was still in the list of endpoints at that time (ntrdy indicates a NotReadyEndpoint):
2022-10-18 12:33:21.712180649 +0000 UTC m=+1047.610174444 - ip-10-0-137-100.ec2.internal - MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True, 2022-10-18 12:33:39.11806612 +0000 UTC m=+1065.016059955 - ip-10-0-129-193.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:39.525574893 +0000 UTC m=+1065.423568712 - dns-default rdy: 10.128.0.2 rdy: 10.128.10.4 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.131.8.4 2022-10-18 12:33:39.526424974 +0000 UTC m=+1065.424418833 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4 2022-10-18 12:33:39.528532869 +0000 UTC m=+1065.426526744 - ip-10-0-129-193.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:39.729859144 +0000 UTC m=+1065.627852917 - ip-10-0-150-205.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:39.936928994 +0000 UTC m=+1065.834922825 - ip-10-0-150-205.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:44.749587947 +0000 UTC m=+1070.647581767 - ip-10-0-188-175.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:44.952196646 +0000 UTC m=+1070.850190469 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.130.8.4 ntrdy: 10.131.8.4 2022-10-18 12:33:44.954865089 +0000 UTC m=+1070.852858965 - ip-10-0-188-175.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:45.159460169 +0000 UTC m=+1071.057454007 - ip-10-0-150-205.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:48.641412229 +0000 UTC m=+1074.539406059 - ip-10-0-188-175.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:48.846438064 +0000 UTC m=+1074.744431900 - ip-10-0-129-193.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:33:54.068542745 +0000 UTC m=+1079.966536563 - ip-10-0-150-205.ec2.internal - MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown, 2022-10-18 12:34:31.752294563 +0000 UTC m=+1117.650288381 - ip-10-0-253-198.ec2.internal - MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True, 2022-10-18 12:34:39.531848219 +0000 UTC m=+1125.429842032 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4 2022-10-18 12:34:39.736866622 +0000 UTC m=+1125.634860439 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 2022-10-18 12:34:39.941934912 +0000 UTC m=+1125.839928742 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3
So we can observe that the node goes into 'Unknown' at 12:33:39, and the endpoint goes into Not Ready soon after.
Not sure if this is a logic problem of draining a node or an issue with the autoscaler at this point in time, but it fixes itself at the next lookup 5 seconds later.
—
Detailed breakdown of how this was reproduced:
1. A cluster with autoscaling enabled is required.
2. Deploy a daemonset that attempt to use DNS / HTTP in a loop, e.g. the following Daemonset was used to test this:
apiVersion: apps/v1 kind: DaemonSet metadata: name: dns-tester labels: app: dns-tester spec: selector: matchLabels: app: dns-tester template: metadata: labels: app: dns-tester spec: containers: - name: dns-tester image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:72f2f7e906c321da6d6a00ce610780e8766e8432f7c553c5d03492f65fe5416c command: ["/bin/sh", "-c"] args: ['while true; do CURL=$(curl redhat.com 2>&1); if [[ "$CURL" == *"not resolve"* ]]; then echo `date` - "DNS failure"; else echo `date` - "Lookup success"; fi; sleep 5; done'] resources: limits: cpu: 100m memory: 200Mi
3. Run the following go program against the same cluster (this is what watches the node and endpoint events for the dns-default endpoints):
package main import ( "context" "fmt" "path/filepath" "sync" "time" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes" _ "k8s.io/client-go/plugin/pkg/client/auth" "k8s.io/client-go/tools/clientcmd" "k8s.io/client-go/util/homedir" ) const dnsNamespace = "openshift-dns" const dnsEndpoint = "dns-default" func nodeWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) { ctx := context.Background() defer waitGroup.Done() var nodes = clientset.CoreV1().Nodes() watcher, err := nodes.Watch(ctx, metav1.ListOptions{}) if err != nil { panic(err.Error()) } ch := watcher.ResultChan() for { event := <-ch node, ok := event.Object.(*corev1.Node) if !ok { fmt.Printf("%v", event) panic("Could not cast to nodes") } fmt.Printf("%v - %s - ", time.Now(), node.Name) for _, condition := range node.Status.Conditions { fmt.Printf(" %v - %v,", condition.Type, condition.Status) } fmt.Println() } } func dnsWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) { ctx := context.Background() defer waitGroup.Done() var api = clientset.CoreV1().Endpoints(dnsNamespace) watcher, err := api.Watch(ctx, metav1.ListOptions{}) if err != nil { panic(err.Error()) } ch := watcher.ResultChan() for { event := <-ch endpoints, ok := event.Object.(*corev1.Endpoints) if !ok { fmt.Printf("%v", event) panic("Could not cast to Endpoint") } fmt.Printf("%v - %v", time.Now(), endpoints.ObjectMeta.Name) for _, endpoint := range endpoints.Subsets { for _, address := range endpoint.Addresses { fmt.Printf(" rdy: %v", address.IP) } for _, address := range endpoint.NotReadyAddresses { fmt.Printf(" ntrdy: %v", address.IP) } } fmt.Println() } } func main() { // AUTHENTICATE var home = homedir.HomeDir() var kubeconfig = filepath.Join(home, ".kube", "config") config, err := clientcmd.BuildConfigFromFlags("", kubeconfig) if err != nil { panic(err.Error()) } clientset, err := kubernetes.NewForConfig(config) if err != nil { panic(err.Error()) } wg := sync.WaitGroup{} wg.Add(2) go dnsWatch(clientset, wg) go nodeWatch(clientset, wg) wg.Wait() }
4. Create simulated pressure on the nodes to force a scaleup - e.g. use the following deployment:
apiVersion: apps/v1 kind: Deployment metadata: name: resource-eater spec: replicas: 4 selector: matchLabels: app: resource-eater template: metadata: labels: app: resource-eater spec: containers: - name: resource-eater image: busybox:latest command: ["/bin/sh", "-c"] args: ["sleep 3600"] resources: requests: memory: "8Gi" cpu: "1000m" affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - store topologyKey: "kubernetes.io/hostname"
5. Wait for the scale up to happen.
6. Delete the deployment that created the node pressure, so the scale down can happen (this can easily take 15 minutes).
7. Observe the events in the watcher program and the logs for the daemonset - this should show the same behavior as detailed above.
we found a few logged bugs that seemed related to this issue affecting clusters on 4.8 through 4.10. Those bugs are as follows:
https://issues.redhat.com/browse/OCPBUGS-647 https://issues.redhat.com/browse/OCPBUGS-488 https://bugzilla.redhat.com/show_bug.cgi?id=2061244
Using the boave mentioned steps, we have been able to reliably reproduce the issue of DNS failures during autoscale-down in 4.10 clusters.
+++ This bug was initially created as a clone of
Bug #2102632
+++
Version:
4.11.0-0.nightly-2022-06-28-160049
$ ./openshift-install version
./openshift-install 4.11.0-0.nightly-2022-06-28-160049
built from commit 6daed68b9863a9b2ecebdf8a4056800aa5c60ad3
release image registry.ci.openshift.org/ocp/release@sha256:b79b1be6aa4f9f62c691c043e0911856cf1c11bb81c8ef94057752c6e5a8478a
release architecture amd64
Platform:
GCP
IPI (automated install with `openshift-install`.
What happened?
During uninstall, the cluster uninstall I received:
E0630 13:17:58.830361 271713 runtime.go:78] Observed a panic: runtime.boundsError{x:22, y:21, signed:true, code:0x1} (runtime error: slice bounds out of range [:22] with length 21)
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x41d43c0?, 0xc0010637e8})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x18?})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x41d43c0, 0xc0010637e8})
/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).formatClusterIDForStorage(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:25
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageIDFilter(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:29
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageLabelOrClusterIDFilter(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:39 +0x1fe
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).listDisks(0xc0015dc900?)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:43 +0x1e
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyDisks(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:116 +0x36
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyCluster(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:174 +0x78e
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000700000})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x19f06638?, 0xc0000721c0?}, 0xc00047d888?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x19f06638, 0xc0000721c0}, 0xc8?, 0x1108485?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x19f06638, 0xc0000721c0}, 0x40d687?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:566 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x19f06670?, 0xc00008b8c0?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:555 +0x46
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).Run(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:130 +0x519
main.runDestroyCmd({0x7fffe6a88d87, 0x9}, 0x0)
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:67 +0x92
main.newDestroyClusterCmd.func1(0xc000536780?, {0xc000906100?, 0x2?, 0x2?})
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:53 +0x7f
github.com/spf13/cobra.(*Command).execute(0xc000536780, {0xc0009060c0, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc00098db80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:60 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
panic: runtime error: slice bounds out of range [:22] with length 21 [recovered]
panic: runtime error: slice bounds out of range [:22] with length 21
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x18?})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x41d43c0, 0xc0010637e8})
/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).formatClusterIDForStorage(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:25
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageIDFilter(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:29
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageLabelOrClusterIDFilter(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:39 +0x1fe
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).listDisks(0xc0015dc900?)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:43 +0x1e
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyDisks(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:116 +0x36
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyCluster(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:174 +0x78e
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000700000})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x19f06638?, 0xc0000721c0?}, 0xc00047d888?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x19f06638, 0xc0000721c0}, 0xc8?, 0x1108485?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x19f06638, 0xc0000721c0}, 0x40d687?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:566 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x19f06670?, 0xc00008b8c0?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:555 +0x46
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).Run(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:130 +0x519
main.runDestroyCmd({0x7fffe6a88d87, 0x9}, 0x0)
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:67 +0x92
main.newDestroyClusterCmd.func1(0xc000536780?, {0xc000906100?, 0x2?, 0x2?})
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:53 +0x7f
github.com/spf13/cobra.(*Command).execute(0xc000536780, {0xc0009060c0, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc00098db80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:60 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
Anything else we need to know?
Uninstall with openshift-install binary from OCP 4.10.16 worked fine.
— Additional comment from
jmencak@redhat.com
on 2022-06-30 12:15:49 UTC —
Created attachment 1893636 [details]Install/uninstall directory tar ball.
Adding install/uninstall directory tar ball.
— Additional comment from
padillon@redhat.com
on 2022-07-01 14:11:22 UTC —
Can we get an install config for the failing destroy?
— Additional comment from
padillon@redhat.com
on 2022-07-01 14:13:30 UTC —
Sorry. I see the install config is in the attachment. I thought that was only the destroy log.
— Additional comment from
padillon@redhat.com
on 2022-07-01 14:28:10 UTC —
Marking this as blocker+. It looks like
https://github.com/openshift/installer/pull/5976
introduced a regression when destroying disks. We should have a PR to fix up today.
— Additional comment from
eparis@redhat.com
on 2022-07-01 15:00:11 UTC —
This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+
— Additional comment from
padillon@redhat.com
on 2022-07-01 17:56:37 UTC —
For QE: This error would occur after installing and provisioning PV.
— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-05 04:27:36 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.
— Additional comment from
jmencak@redhat.com
on 2022-07-07 08:38:59 UTC —
I still see the same issue with the latest nightly 4.11.0-0.nightly-2022-07-06-145812. Is the fix included there?
— Additional comment from
padillon@redhat.com
on 2022-07-07 12:51:25 UTC —
We didn't cherry-pick this fix into 4.11 so it is not in the nightlies. You should be able to check it against a master build. We will cherry-pick to 4.11 now.
$ oc adm release extract --tools registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-06-145812
$ tar -xvf openshift-install-linux-4.11.0-0.nightly-2022-07-06-145812.tar.gz
README.md
openshift-install
$ ./openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-06-145812
built from commit b2e7be726e400022e71ef3b8bd01a2093e53bc5a
release image registry.ci.openshift.org/ocp/release@sha256:616c5fefa87d116dd2440c75d9832c462078d635ed155c8d6cd486dd09540184
release architecture amd64
$ git show b2e7be726e400022e71ef3b8bd01a2093e53bc5a
commit b2e7be726e400022e71ef3b8bd01a2093e53bc5a (upstream/release-4.11)
Merge: 6daed68b9 2426260d5
Author: openshift-ci[bot] <75433959+openshift-ci[bot]@users.noreply.github.com>
Date: Thu Jun 30 22:11:44 2022 +0000
Merge pull request #6060 from mike-nguyen/dnm_411_test
Bug 2093126
: bump RHCOS 4.11 boot image metadata
This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:
—
Frequently we see the loading state of the topology view, even when there aren't many resources in the project.
Including an example
topology will sometimes hang with the loading indicator showing indefinitely
topology should load consistently without fail
intermittent
4.9
This bug card represents work done in https://issues.redhat.com/browse/CCO-257 to set STS endpoints to regional in AWS credentials secrets and is created to facilitate backporting the change to previous releases as required by the backport process [1].
This is a clone of issue OCPBUGS-11489. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11348. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11329. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-11158. The following is the description of the original issue:
—
The Mailer Plugin (mailer) version 435.438.v5b_81173f5b_a_1 is not compatible with the Pipeline: Basic Steps (workflow-basic-steps) plugin version 2.20.
Both plugins need to be updated to newer versions at the same time per https://github.com/jenkinsci/mailer-plugin/releases/tag/435.v79ef3972b_5c7
This is a clone of issue OCPBUGS-2077. The following is the description of the original issue:
—
Description of problem:
Pipeline list page fetches all the pipelineruns to find the last pipeline run and which results in more load time. This performance issue needs to be addressed in all the pieplines list pages wherever applicable.
Version-Release number of selected component (if applicable):
4.9
How reproducible:
Always
Steps to Reproduce:
1. Create 10+ pipelines in a namespace
2. Create more number of pipelineruns under each pipeline
3. navigate to piplines list page.
Actual results:
Pipelines list will take a long time to load the list.
Expected results:
Pipeline list should not take more time to load the list.
Additional info:
Reduce the amount to data fetched to find the last pipelinerun, maybe use PartialMetadata to find the latest pipeline run and to improve the performance.
This is a clone of issue OCPBUGS-1786. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
This is a request to back port the fix in OCPBUGS-1718 to Openshift 4.10.
Description cloned from that bug:
Description of problem:
prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests
Version-Release number of selected component (if applicable):
4.11.6
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Hard reboot node via out of band interface 3. oc -n openshift-monitoring get pods prometheus-k8s-0
Actual results:
NAME READY STATUS RESTARTS AGE prometheus-k8s-0 5/6 CrashLoopBackOff 125 (4m57s ago) 5h28m
Expected results:
Running
Additional info:
Attaching must-gather. The pod recovers successfully after deleting/re-creating. [kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0 ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)" ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)" ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))" ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)" ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)" ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090 ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..." ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..." ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..." ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped" ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped" ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"
Description of problem:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
Version-Release number of selected component (if applicable): 4.10.22
How reproducible: Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Steps to Reproduce:
Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Actual results: master nodes do come up.
Expected results: master nodes should come up despite that the text "master" is not in their hostname.
Additional info:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
The code for the cluster-baremetal-operator at the following link:
The following condition is concerning:
if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0
The packages reveal that bmh.Name references the name inside the metadata of the BMH object.
Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :
macs = append(macs, bmh.Spec.BootMACAddress)
This is a clone of issue OCPBUGS-9986. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7445. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:
—
At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.
But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.
However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.
This was probably not detected in QE as well for the same reason as CI.
This is a clone of issue OCPBUGS-11163. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10976. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10934. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10917. The following is the description of the original issue:
—
Description of problem:
Product security has set a required Jenkins version to 2.387.1 for June 6th, 2023
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-15645. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-15643. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-15606. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-15497. The following is the description of the original issue:
—
I am using a BuildConfig with git source and the Docker strategy. The git repo contains a large zip file via LFS and that zip file is not getting downloaded. Instead just the ascii metadata is getting downloaded. I've created a simple reproducer (https://github.com/selrahal/buildconfig-git-lfs) on my personal github. If you clone the repo
git clone git@github.com:selrahal/buildconfig-git-lfs.git
and apply the bc.yaml file with
oc apply -f bc.yaml
Then start the build with
oc start-build test-git-lfs
You will see the build fails at the unzip step in the docker file
STEP 3/7: RUN unzip migrationtoolkit-mta-cli-5.3.0-offline.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive.
I've attached the full build logs to this issue.
+++ This bug was initially created as a clone of Bug #2118717 +++
https://bugzilla.redhat.com/show_bug.cgi?id=2118717
Description of problem:
This BZ is a spin-off of BZ-2114945 so we can track possible issues with new TCP connections from pods failing to be created on the nodes leading to pods being unable to start or crash.
Version-Release number of selected component (if applicable):
OCP 4.10.24 with OVN-Kubernetes
How reproducible:
Periodically on the customer only so far.
— Additional comment from Andre Costa on 2022-08-10 16:30:00 UTC —
There are 3 must-gathers here that were gathered during the issues and after the restart of OVNk-masters which makes all these issues go away and pods start connections immediately.
This must-gather was taken at 11 AM today when they received a report from one of the customers:
Customer reported the issue again and this time we also got sosreport and inspect from the project.
In the pod they get errors like this (the same we saw on the call last week with them where it seems no TCP connections entries are created at all. First we though it was DNS but even with IPs directly there were issues like this):
-----------------
mx-toni-dev toni-dev-build 0/1 Error 0 18m 10.195.80.253 demchdc5vvx <none> <none>
[z0003rbj-z07@stuart ~]$ oc logs toni-dev-build
time="2022-08-10T10:59:02Z" level=info msg="Start building app with registry type openshift"
time="2022-08-10T10:59:02Z" level=info msg="Adding ssl certificate /etc/ssl/certs/ca-bundle.crt"
time="2022-08-10T10:59:02Z" level=info msg="Certificate /etc/ssl/certs/ca-bundle.crt has been added successfully"
time="2022-08-10T10:59:02Z" level=info msg="Updating docker config with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Docker config has been updated with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Downloading MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042"
time="2022-08-10T10:59:32Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout"
-----------------------
This keeps happening if they continue to run the builds which they did and created the MG and sosreport:
And like we have seen so far restarting the ovnk-master pods makes these connections work immediately again:
— Additional comment from Andre Costa on 2022-08-10 16:30:43 UTC —
— Additional comment from Tim Rozet on 2022-08-10 22:36:06 UTC —
Thanks for the must gathers. From Flavio and I examining them, there is definitely a bug here in ovn-kube. The toni-dev-build pod is deleted/recreated multiple times, and during this time it moves to different nodes. However due to a bug in OVNK, this port is updated with the new ip address and information as if it was moving to the new node, but stays on the previous logical switch. So for example, this is what happens:
1. The pod is originally assigned to node demchdc6zax. This node's cluster subnet is 10.195.79.0/24:
2022-08-09T09:22:57.078688727+00:00 stderr F I0809 09:22:57.078632 2239319 cni.go:248] [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1] ADD finished CNI request [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1], result "{\"interfaces\":[
{\"name\":\"b1a4fb0be20ff71\",\"mac\":\"a6:68:38:ad:66:c8\"},
{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:4f:52\",\"sandbox\":\"/var/run/netns/e354d2d5-83cb-406f-a2d9-c5f3e786bae4\"}],\"ips\":[
{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.79.82/24\",\"gateway\":\"10.195.79.1\"}],\"dns\":{}}", err <nil
2. Over time this pod is completed, deleted, recreated many times. Until eventually it lands on demchdc5vvx the next day:
2022-08-10T08:43:58.428759111Z I0810 08:43:58.428719 1837017 cni.go:248] [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a] ADD finished CNI request [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a], result "{\"interfaces\":[
,
{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:50:1e\",\"sandbox\":\"/var/run/netns/bae8f77a-b368-4b3a-86dc-df925330fa26\"}],\"ips\":[
{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.80.30/24\",\"gateway\":\"10.195.80.1\"}],\"dns\":{}}", err <nil>
3. Although it lands on a new node, in OVNK we update the old port (somehow the old port is not being removed) that is attached to the old switch:
[root@fedora ~]# ovn-nbctl list logical_switch_port c345cc07-8a89-4e70-beff-d8d9f4dac46a
_uuid : c345cc07-8a89-4e70-beff-d8d9f4dac46a
addresses : ["0a:58:0a:c3:50:1e 10.195.80.30"]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids :
ha_chassis_group : []
name : mx-toni-dev_toni-dev-build
options :
parent_name : []
port_security : ["0a:58:0a:c3:50:1e 10.195.80.30"]
tag : []
tag_request : []
type : ""
up : false
[root@fedora ~]# ovn-nbctl lsp-list demchdc6zax | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
c345cc07-8a89-4e70-beff-d8d9f4dac46a (mx-toni-dev_toni-dev-build)
[root@fedora ~]# ovn-nbctl lsp-list demchdc5vvx | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
[root@fedora ~]#
This will cause the pod not to be able to send any traffic as its IP is in the wrong subnet for this switch.
4. Additionally the default node SNAT for this pod is in the right place:
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc6zax | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5vvx | grep 10.195.80.30
snat 139.25.144.25 10.195.80.30
5. But there is no egress IP reroute or SNAT entry for this pod:
Egress IP:
status:
items:
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 139.25.144.72
snat 139.25.144.72 10.195.77.156
snat 139.25.144.72 10.195.80.40
snat 139.25.144.72 10.195.76.65
snat 139.25.144.72 10.195.76.184
snat 139.25.144.72 10.195.80.42
snat 139.25.144.72 10.195.80.156
6. We see in the ovnkube-master logs that ovnk attempts to delete this pod, but it fails because we try to delete a logical switch port that is still bound to the wrong logical switch:
2022-08-10T09:07:32.033303027Z I0810 09:07:32.033270 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.80.30]}}] Timeout:<nil> Where:[where column _uuid ==
] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[
{GoUUID:c345cc07-8a89-4e70-beff-d8d9f4dac46a}]}}] Timeout:<nil> Where:[where column _uuid == {f5073eaa-3f72-4ec2-94c3-3744d412864a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c345cc07-8a89-4e70-beff-d8d9f4dac46a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"Rows:[]} {Count:1 Error: Details: UUID:
{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:
{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)Rows:[]} {Count:1 Error: Details: UUID:
{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:
{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)Rows:[]} {Count:1 Error: Details: UUID:
{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:
{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)Rows:[]} {Count:1 Error: Details: UUID:
{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:
{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)Rows:[]} {Count:1 Error: Details: UUID:
{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:
{GoUUID:}Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
2022-08-11T08:26:24.264204783Z time="2022-08-11T08:26:24Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout"
— Additional comment from Tim Rozet on 2022-08-11 21:19:12 UTC —
We were able to reproduce the issue locally. There are two potential paths that can cause a stale logical switch port to be re-used on the wrong node:
Scenario 1: pod is created, deleted, and recreated on another node extremely quickly. This plays out like this:
Events:
1. pod toni is created on node A
2. pod toni is deleted on node A
3. pod toni is recreated on node B
What happens in ovnk:
1. pod toni is created on node B
2. pod toni is deleted on node A (fails)
This happens because we grab the latest version of the pod to add in event 1, which by the time we grab it is actually the value from event 3. Then after processing event 1, we move to event 2 and when we delete the pod, we use what was given to us in the event, which is now incorrect to remove the pod from node A, because it actually is on node B. The fix is to ignore the value in the pod spec on deletion, and use what we store internally in our internal port cache. If there is no entry in the cache, we search OVN NBDB to find the right switch (this is an expensive operation so we want to avoid it when possible).
Flavio is working on a fix for this.
Scenario 2: pod is created, runs to completion, is deleted very quickly, and then is recreated on another node:
Events:
1. pod toni is created on node A
2. pod toni runs to completion
3. pod toni immediately is deleted
4. pod toni is recreated on node B
What happens in ovnk:
1. pod toni is created on node A
2. completion causes an update event, ovnk does not delete the pod (bug)
3. deletion event is processed, but since it is a completed pod, we ignore it (since we should have already delted it in step 2)
4. pod toni is recreated on node B, ovnk sees there is already an switch port for toni on the wrong switch, and just updates that with the new information
This happens when a pod goes to completed, but is deleted very quickly. In step 2 during update event we try to grab the latest version of the pod, but it doesn't exist anymore since it was deleted. In this case we skip the update, instead of tearing down the pod. The delete code in step 3 assumes that if the pod is completed we must have already handled it in update, so the stale port stays around until an add later re-uses it. Patryk already has a fix for this:
https://github.com/ovn-org/ovn-kubernetes/pull/3071
There may still be more egress IP issues to investigate (tracked in other bugs) after fixing this, and we will look into those after fixing these fundamental pod issues.
— Additional comment from Tim Rozet on 2022-08-11 22:19:30 UTC —
Andre, re: comment 5. This looks like the same issue. If you have a must gather we can confirm. I think the sosreport from the worker node is not enough. The referential integrity violation occurs because we attempt to do these operations:
1. remove the logical switch port from the logical switch
2. delete the logical switch port
In this case:
1. we remove the logical switch port from the logical switch, but the switch port actually exists on a different switch (node) - this is a noop
2. we try to delete the logical switch port- NBDB complains this is an violation and refuses to delete it, because there is another switch that holds a reference to this object
— Additional comment from Anurag saxena on 2022-08-12 21:16:56 UTC —
@rbrattai@redhat.com Can you help verifying this and backports? feel free to re-assign to someone else if needed while I am on PTO. Thanks!
Description of problem:
clone of https://bugzilla.redhat.com/show_bug.cgi?id=2076307
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1104. The following is the description of the original issue:
—
Description of problem:
In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OCP 4.8 2. Upgrade to OCP 4.9 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-08-31-160214 True True 50m Working towards 4.9.47: 619 of 738 done (83% complete) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.47 True False 4m26s Cluster version is 4.9.47
Actual results:
Check packageserver CSV. It's in v0.17.0 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 0.17.0 Succeeded
Expected results:
packageserver CSV is at 0.18.3
Additional info:
packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12 packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8
Description of problem:
To avoid any potential bugs, the oVirt CSI driver should use the latest go-ovirt-client, preferably the tagged 1.0.0 version.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We need to include the `openshift_apps_deploymentconfigs_strategy_total` metrics to the IO archive file.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a cluster 2. Download the IO archive 3. Check the file `config/metrics` 4. You must find `openshift_apps_deploymentconfigs_strategy_total` insde of it
Actual results:
Expected results:
You should see the `openshift_apps_deploymentconfigs_strategy_total` at the `config/metrics` file.
Additional info:
Description of problem:
Upgrade to 4.10 is stuck looping in syncEgressFirewall We see transacting operations with context deadline exceeded. It looks to be trying to process 2.8 million records is one go. 2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127 1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 Everything is built into one operation via: https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243 TrandactAndCheck is being called with a 10s timeout and this operation never completes.
Version-Release number of selected component (if applicable):
4.10.50
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Upgrade completes
Additional info:
Description of problem:
The `clusterautoscaler` in his cluster to scale up and down the nodes automatically as per the requirement and load.
However, it has been noticed that when removing test load Pods, nodes do not scale down.
Version-Release number of selected component (if applicable):
OpenShift Version: 4.10.20
How reproducible:
Easily reproducible: https://docs.openshift.com/container-platform/4.10/rest_api/autoscale_apis/clusterautoscaler-autoscaling-openshift-io-v1.html#specification
Expected results:
When there is no load, I waited a very long time for the ClusterAutoscaler to scale down, but this never occurs. The ClusterAutoscaler controller Pod's logs keep saying no nodes are eligible to scale down despite the nodes very quite idle.
Additional info:
This Bugzilla is looking similar:
https://bugzilla.redhat.com/show_bug.cgi?id=2053343
Interestingly, I could not find the OLM and redhat-operator pod scheduled on any node other than the master.
Kindly have a look at the attached file for pod details.
*USER STORY:*
As a customer or OpenShift engineer, I want to see the user agent for anything calling from OpenShift -> vSphere to eliminate troubleshooting guesswork.
*DESCRIPTION:*
A question in #forum-vmware was raised where we identified that the user-agent may not be configured for all OpenShift components calling to vSphere API.
https://coreos.slack.com/archives/CH06KMDRV/p1627368902058800
*Required:*
Audit of OpenShift components calling to vSphere API to make sure user agent strings are set appropriately.
*Nice to have:*
How can this be prevented in the future? How can we minimize maintenance costs added by new PRs/bugs reported from this spike?
*ACCEPTANCE CRITERIA:*
New PRs or bug reports for each effected component.
This bug is a backport clone of [Bugzilla Bug 2019564](https://bugzilla.redhat.com/show_bug.cgi?id=2019564). The following is the description of the original bug:
—
Description of problem:
On OpenShift Dev Sandbox users are automatically deleted after a month. Also on other clusters we should cleanup resources when the user are deleted.
Version-Release number of selected component (if applicable):
4.8+
How reproducible:
Always
Steps to Reproduce:
1. Create a user
2. Use console (this should automatically create 3 resources in the namespace "openshift-console-user-settings" (a Role, a RoleBinding and a ConfigMap)
3. Delete the user
Actual results:
The three created resources in namespace "openshift-console-user-settings" stays forever
Expected results:
The three created resources in namespace "openshift-console-user-settings" should be automatically removed
Additional info:
None
copy of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2053622
Description of problem:
PodDisruptionBudgetAtLimit Warning alert when CR replica count is zero.
Version-Release number of selected component (if applicable):
4.7
How reproducible: Everytime
Steps to Reproduce:
1. oc new-project test
2. oc new-app httpd
3. oc create -f pdb
$ cat pdb.yaml
~~~
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
deployment: httpd
~~~
$ oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
my-pdb N/A 0 0 3h27m
4. oc scale deployment httpd --replicas=0
5. Wait for some time alert will be triggered at the console.
Actual results: unexpected warning alert
Expected results: As we are intentionally dropping down the replicas it should not generate an alert.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4851. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4850. The following is the description of the original issue:
—
Description of problem:
Kuryr might take a while to create Pods because it has to create Neutron ports for the pods. If a pod gets deleted while this is being processed, a warning Event will be generated causing the "[sig-network] pods should successfully create sandboxes by adding pod to network" to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1942. The following is the description of the original issue:
—
Description of problem:
Bump Jenkins version to 2.361.1 and also test the images built by running verify-jenkins.sh script. This script verifies the jenkins versions and plugin in an image. Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh
Version-Release number of selected component (if applicable):
2.361.1
Additional info:
Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh
Description of problem:
Add the ability to run utests and linter jobs in downstream ovn-kubernetes
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
[ovn] [ocp 4.10.z] Service `spec.externalTrafficPolicy` does not trigger rules update in ovnkube-node pod handlers on edit, even though it does successfully update the rules if deployed explicitly with that spec value set, or if you delete the handler pods for ovn (forces a refresh).
Version-Release number of selected component (if applicable):
observed in 4.10.32 and 4.10.40, tested on azure platform.
How reproducible:
every time
Steps to Reproduce:
1. Deploy a test pod with curlable resource in a test namespace 2. create a service from yaml exposing pod at internal clusterIP (example yaml provided by customer below) ~~~ apiVersion: v1 kind: Service metadata: labels: run: test name: test annotations: service.beta.kubernetes.io/azure-load-balancer-internal: "true" service.beta.kubernetes.io/azure-load-balancer-internal-subnet: paas1 spec: allocateLoadBalancerNodePorts: true externalTrafficPolicy: Cluster ##MODIFY THIS SPEC VALUE AND OBSERVE FAIL CONDITION internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - port: 8000 protocol: TCP targetPort: 8000 selector: run: test sessionAffinity: None type: LoadBalancer ~~~ 3. curl against service succeeds 4. edit service to change `spec.externalTrafficPolicy: local` 5. observe externalIP does not change, but healthz port updates 6. curl against same externalIP:port time out indefinitely, no response. //workaround: delete service and redeploy with spec line set already to `local`, or delete ovnkube-node pod serving pod(s) to force refresh the local ruleset and allow traffic (curls subsequently will succeed).
Actual results:
spec change appears to update properly in the database but does not send a notification to update the ovnkube-node pod handlers (or similar) to allow traffic through once the externalTrafficPolicy spec value is changed.
Expected results:
spec change to service yaml should be immediately updated in DB AND update ovnkube-node handlers for same.
Additional info:
Attachments available and case number with specifics in next internal comment.
This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:
—
The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.
This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:
0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found
The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.
$ oc get cm vsphere-csi-config -o yaml -n openshift-cluster-csi-drivers | grep datacenters
datacenters = "IBMCloud"
Description of problem:
NodePort port not accessible
Version-Release number of selected component (if applicable):
OCP 4.8.20
How reproducible:
$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry
$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
In a new-created namespaces the same deployment works:
[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000
[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s
[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1523. The following is the description of the original issue:
—
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.
Actual results:
Taking too much time too load
Expected results:
Time taken should be reduced
Additional info:
Attached a gif for reference
Description of problem:
When a custom machineConfigPool is created and no node is associated with it, the mcp remains at 0% progress.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create a custom mcp: ~~~ cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: custom spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,custom]} nodeSelector: matchLabels: node-role.kubernetes.io/custom: "" EOF ~~~
Actual results:
The mcp is visible from the from "Administrator view > Cluster Settings > Details" at 0% progress
Expected results:
It shouldn't be stuck at 0%
Additional info:
This is a clone of issue OCPBUGS-2451. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2181. The following is the description of the original issue:
—
Description of problem:
E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name
Description of problem:
For some reason, the LSP of a pod is not properly added to the port group where the ACL of a NetworkPolicy is applied. This results on the networkpolicy not being applied to the pod and communication not possible.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always with a concrete pod at customer environment.
Steps to Reproduce:
(not known exactly yet)
Actual results:
LSP not in port group. ACL not applied. Netpol not in effect.
Expected results:
LSP in port group. ACL applied. Netpol in effect.
Additional info:
Details in private comments, as they involve sensitive data. Deleting the pod does nothing, but it is possible that this has something to do with the pod being recreated with the same name (although the LSPs UUIDs are different in each incarnation).
Description of problem:
With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.
This is a clone of issue OCPBUGS-1556. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-78. The following is the description of the original issue:
—
Copied from an upstream issue: https://github.com/operator-framework/operator-lifecycle-manager/issues/2830
What did you do?
When attempting to reinstall an operator that uses conversion webhooks by
The resulting InstallPlan enters a failed state with message similar to
error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found
When the original CSVs are deleted, the operator's main deployment and service are removed, but CRDs are left in-cluster. However, since the service/CA bundle/deployment that serve the conversion webhook are removed, conversion webhooks are broken at that point. Eventually this impacts garbage collection on the cluster as well.
This can be reproduced by installing the DevWorkspace Operator from the Red Hat catalog. (I can provide yamls/upstream images that reproduce as well, if that's helpful). It may be necessary to create a DevWorkspace in the cluster before deletion, e.g. by oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/plain.yaml
What did you expect to see?
Operator is able to be reinstalled without removing CRDs and all instances.
What did you see instead? Under which circumstances?
It's necessary to completely remove the operator including CRDs. For our operator (DevWorkspace), this also makes uninstall especially complicated as finalizers are used (so CRDs cannot be deleted if the controller is removed, and the controller cannot be restored by reinstalling)
Environment
operator-lifecycle-manager version: 4.10.24
Kubernetes version information: Kubernetes Version: v1.23.5+012e945 (OpenShift 4.10.24)
Kubernetes cluster kind: OpenShift
This is a clone of issue OCPBUGS-3882. The following is the description of the original issue:
—
This bug is a backport clone of [Bugzilla Bug 2034883](https://bugzilla.redhat.com/show_bug.cgi?id=2034883). The following is the description of the original bug:
—
Description of problem:
Situation (starting point):
Problem:
Version-Release number of MCO (Machine Config Operator) (if applicable):
4.7.21
Platform (AWS, VSphere, Metal, etc.): (not relevant)
Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y
How reproducible:
Always if the said conditions are met.
Steps to Reproduce:
1. Have some nodes not ready
2. Force a change that requires machine-config-daemon daemonset rollout (I think that changing proxy settings would work for this)
3. Wait until a new kube-apiserver-to-kubelet-client-ca is rolled out by kube-apiserver-operator
Actual results:
New kube-apiserver-to-kubelet-client-ca not forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca not deployed on nodes
Expected results:
kube-apiserver-to-kubelet-client-ca forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca deployed to nodes.
Additional info:
In comments
Description of problem:
Intended to backport the corresponding https://bugzilla.redhat.com/show_bug.cgi?id=2095852 which has been fixed already for this version.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4607. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4422. The following is the description of the original issue:
—
This bug is a backport clone of [Bugzilla Bug 2050230](https://bugzilla.redhat.com/show_bug.cgi?id=2050230). The following is the description of the original bug:
—
Description of problem:
In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources.
Version-Release number of selected component (if applicable):
How reproducible:
NA
Steps to Reproduce:
NA
Actual results:
Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible.
Expected results:
Kube API Server and Openshift API Server should be stable.
Additional info:
Description of problem:
Customer facing issue with jenkins on Openshift : openshift4/ose-jenkins-agent-base node images v4.10 not able to communicate with openshift4/ose-jenkins v4.10.0
Version-Release number of selected component (if applicable):
4.10.0
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
During agent connection received : Exception in thread "main" java.lang.UnsupportedClassVersionError: hudson/remoting/jnlp/Main has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
Expected results:
OPENSHIFT_JENKINS_JVM_ARCH='', CONTAINER_MEMORY_IN_MB='4096', using /usr/lib/jvm/java-11-openjdk-11.0.19.0.7-1.el8_7.x86_64/bin/java Downloading http://controller.ocp050014.svc:80/jnlpJars/remoting.jar ... + cd + exec java -Duser.home=/home/jenkins -Dcom.redhat.fips=false -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -Xmx2048m -cp /home/jenkins/remoting.jar hudson.remoting.jnlp.Main -headless -url http://controller.ocp050014.svc:80/ -tunnel controller-jnlp.ocp050014.svc:50000 XXX ci-automation-tools-j529v Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: ci-automation-tools-j529v Apr 25, 2023 11:20:42 AM hudson.remoting.Engine startEngine INFO: Using Remoting version: 3044.vb_940a_a_e4f72e Apr 25, 2023 11:20:42 AM hudson.remoting.Engine startEngine WARNING: No Working Directory. Using the legacy JAR Cache location: /home/jenkins/.jenkins/cache/jars Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://controller.ocp050014.svc:80/] Apr 25, 2023 11:20:42 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Apr 25, 2023 11:20:42 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: controller-jnlp.ocp050014.svc Agent port: 50000 Identity: 20:1a:e9:55:d5:a6:a1:91:b9:7a:43:de:e0:0b:c9:04 Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to controller-jnlp.ocp050014.svc:50000 Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Apr 25, 2023 11:20:42 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run INFO: Waiting for ProtocolStack to start. Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: 20:1a:e9:55:d5:a6:a1:91:b9:7a:43:de:e0:0b:c9:04 Apr 25, 2023 11:20:42 AM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected
Additional info:
From customer: for now we’ve tested v4.10, v4.11, v4.12 agent base image with this image. With 4.10 we encounter java ver issue 4.11 and 4.12 generates issues with template compatibility we are using.
Description of problem:
Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.
Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.
Version-Release number of selected component (if applicable):
OCP 4.10.16
How reproducible:
Perform OCP 4.10.16 IPI BareMetal install.
Actual results:
provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.
Expected results:
provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)
Additional info:
As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.
Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.
This is a clone of issue OCPBUGS-501. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable): 4.10.16
How reproducible: Always
Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field
$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:
2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc
Actual results:
Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)
Expected results: The command "oc get dc" should display the deploymentconfig without any error.
Additional info:
This is a clone of issue OCPBUGS-6907. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6517. The following is the description of the original issue:
—
Description of problem:
When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].
[1]https://github.com/openshift/cluster-image-registry-operator/pull/819
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy a cluster with proxy and restricted installation 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If we use a macvlan with the configuration... spec: config: '{ "cniVersion": "0.3.1", "name": "ran-bh-macvlan-test", "plugins": [ {"type": "macvlan","master": "vlan306", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "2001:1b74:480:603d:0304:0403:000:0000-2001:1b74:480:603d:0304:0403:0000:0004/64","gateway": "2001:1b74:480:603d::1" } } ]}' there is an error creating the pod: Warning FailedCreatePodSandBox 17s (x3 over 55s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test31_test-ecoloma-01_a593bd0a-83e7-4d31-857e-0c31491e849e_0(5cf36bd99ffa532fd34735e68caecfbc69d820ba6cb04e348c9f9f168498022f): error adding pod test-ecoloma-01_test31 to CNI network "multus-cni-network": [test-ecoloma-01/test31:ran-bh-macvlan-test]: error adding container to network "ran-bh-macvlan-test": Error at storage engine: OverlappingRangeIPReservation.whereabouts.cni.cncf.io "2001-1b74-480-603d-304-403--" is invalid: metadata.name: Invalid value: "2001-1b74-480-603d-304-403--": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*') If we change the start IP address to 2001:1b74:480:603d:0304:0403:000:0001, it works ok ok.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always reproducible
Steps to Reproduce:
1. See description of problem.
Actual results:
Unable to create pod
Expected results:
IP range should be valid and pod should get created
Additional info:
Description of problem:
Insight Operator keeps rebooting on Openshift 4.10.18 in disconnected environment
Version-Release number of selected component (if applicable):
Openshift Container Platform 4.10.18
How reproducible:
Install Openshift 4.10.18 in disconnected environment
Steps to Reproduce:
1. Download the global cluster pull secret to your local file system. 2. In a text editor, edit the .dockerconfigjson file that was downloaded. 3. Remove the cloud.openshift.com JSON entry 4. Observe that insight operator keeps rebooting
Actual results:
Insight Operator keeps rebooting
Expected results:
Insight Operator should be stable
Additional info:
Disconnected Environment
This bug is a backport clone of [Bugzilla Bug 2056519](https://bugzilla.redhat.com/show_bug.cgi?id=2056519). The following is the description of the original bug:
—
Version: 4.9
Platform:
azure
Please specify:
What happened?
Issue: Customer reports unable to install IPI PRIVATE OpenShift cluster in Azure. They have orgnization policy which do not allow them to create storage account with public access. It should be disallowed.
What did you expect to happen?
Installer completes successfully.
Description of problem: Issue described in following issue: https://github.com/openshift/multus-admission-controller/issues/40
Fixed in: https://github.com/openshift/cluster-network-operator/pull/1515
Version-Release number of selected component (if applicable): OCP 4.10
Official Red Hat tracker. Issue has been merged already.
This is a clone of issue OCPBUGS-262. The following is the description of the original issue:
—
github rate limit failures for upi image downloading govc.
Currently, Telemeter is not equipped with configurable request limit for receive endpoint (for full context see: https://github.com/openshift/cluster-monitoring-operator/pull/1416). It is using the default limit defined in the code base, however it seems this limit might not be suitable for our usage.
As a part of this ticket, it should be:
1) Understood what is the appropriate limit for request size for our use cases
2) Make the limit configurable in Telemeter via a flag
3) Deploy the changes, initially to the staging environment, to enable our team to test it.
This is a clone of issue OCPBUGS-7800. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-266. The following is the description of the original issue:
—
Description of problem: I am working with a customer who uses the web console. From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console. This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead. What we really need is a third column in the Project Access tab that says whether a resource is a user or group.
Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well
How reproducible: Every time. My customer is running on ROSA, but I have determined this issue to be general to OpenShift.
Steps to Reproduce:
From the oc cli, I create a group and add a user to it.
$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME USERS
cluster-admins
dedicated-admins admin
techlead admin
I create a new namespace so that I can assign a group project level access:
$ oc new-project my-namespace
$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access. I verified the rolebinding named 'edit' is bound to a group named 'techlead'.
$ oc get rolebinding
NAME ROLE AGE
admin ClusterRole/admin 15m
admin-dedicated-admins ClusterRole/admin 15m
admin-system:serviceaccounts:dedicated-admin ClusterRole/admin 15m
dedicated-admins-project-dedicated-admins ClusterRole/dedicated-admins-project 15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin ClusterRole/dedicated-admins-project 15m
edit ClusterRole/edit 2m18s
system:deployers ClusterRole/system:deployer 15m
system:image-builders ClusterRole/system:image-builder 15m
system:image-pullers ClusterRole/system:image-puller 15m
$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: "2022-08-15T14:16:56Z"
name: edit
namespace: my-namespace
resourceVersion: "108357"
uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: edit
subjects:
Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.
$ oc get rolebinding
NAME ROLE AGE
admin ClusterRole/admin 17m
admin-dedicated-admins ClusterRole/admin 17m
admin-system:serviceaccounts:dedicated-admin ClusterRole/admin 17m
dedicated-admins-project-dedicated-admins ClusterRole/dedicated-admins-project 17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin ClusterRole/dedicated-admins-project 17m
edit ClusterRole/edit 4m25s
developer-view-c15b720facbc8deb ClusterRole/view 90s
system:deployers ClusterRole/system:deployer 17m
system:image-builders ClusterRole/system:image-builder 17m
system:image-pullers ClusterRole/system:image-puller 17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: "2022-08-15T14:19:51Z"
name: developer-view-c15b720facbc8deb
namespace: my-namespace
resourceVersion: "113298"
uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: view
subjects:
So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups. This is in essence our ask for this RFE.
Actual results:
Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them. Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.
Expected results:
Should have the ability to add groups and differentiate between users and groups. Ideally, we're looking at a third column for user or group.
Additional info:
This is a clone of issue OCPBUGS-1890. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1765. The following is the description of the original issue:
—
Description of problem:
If a customer creates a machine with a networks section like this networks: - filter: {} noAllowedAddressPairs: false subnets: - filter: {} uuid: primary-subnet-uuid - filter: {} noAllowedAddressPairs: true subnets: - filter: {} uuid: other-subnet-uuid primarySubnet: primary-subnet-uuid Then all the ports are created without the allowed address pairs. Doing some research in the source code, I have found that: - For each entry on the networks: section, networks are filtered as per its filter: section[1] - Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin: - The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4]. - For each subnet entry that matches the network ID, a port is created[5]. So, the problematic behavior happens due to the following: - Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks. - This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected. - However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering. - As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4]. Why do we consider this behavior undesired? I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.
Version-Release number of selected component (if applicable):
4.10.30
How reproducible:
Always
Steps to Reproduce:
1. Create a machineset like in the description 2. 3.
Actual results:
All ports have no address pairs
Expected results:
Only the port on the secondary subnet has no address pairs.
Additional info:
A simple workaround would be to just fill the filter so that a single network is selected for each network entry. References: [1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576 [2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580 [3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583 [4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660 [5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625
Description of problem:
In the WebUI for the Virtualization Overview, the details of "Service name", "Provider", and "Update Channel" have no value displayed. The "OpenShift Virtulization version" is showing "Cannot update CatalogSource not found"
Version-Release number of selected component (if applicable):
v4.10.4
How reproducible:
In 3 environments that have recently been deployed all show the same thing. 100%
Steps to Reproduce:
1. Install the OpenShift Virtualization operator from the WedUI 2. Use the suggested options 3.
Actual results:
The Details card is showing a warning that it cannot update
Expected results:
The Details card should have all values provided.
Additional info:
This bug is a backport clone of [Bugzilla Bug 2117324](https://bugzilla.redhat.com/show_bug.cgi?id=2117324). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2101357 +++
Description of problem:
message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
+0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
+0x231\n"
reason: Error
startedAt: "2022-06-27T00:00:59Z"
Version-Release number of selected component (if applicable):
mac:~ jianzhang$ oc exec catalog-operator-66cb8fd8c5-j7vkx – olm --version
OLM version: 0.19.0
git commit: 8c2bd46147a90d58e98de73d34fd79477769f11f
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-25-081133 True False 10h Cluster version is 4.11.0-0.nightly-2022-06-25-081133
How reproducible:
always
Steps to Reproduce:
1. Install OCP 4.11
2. Check OLM pods
Actual results:
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-66cb8fd8c5-j7vkx 1/1 Running 2 (8h ago) 10h
collect-profiles-27605340-wgsvf 0/1 Completed 0 42m
collect-profiles-27605355-ffgxd 0/1 Completed 0 27m
collect-profiles-27605370-w7ds7 0/1 Completed 0 12m
olm-operator-6cfd444b8f-r5q4t 1/1 Running 0 10h
package-server-manager-66589d4bf8-csr7j 1/1 Running 0 10h
packageserver-59977db6cf-nkn5w 1/1 Running 0 10h
packageserver-59977db6cf-nxbnx 1/1 Running 0 10h
mac:~ jianzhang$ oc get pods catalog-operator-66cb8fd8c5-j7vkx -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.26"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.26"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-26T23:12:45Z"
generateName: catalog-operator-66cb8fd8c5-
labels:
app: catalog-operator
pod-template-hash: 66cb8fd8c5
name: catalog-operator-66cb8fd8c5-j7vkx
namespace: openshift-operator-lifecycle-manager
ownerReferences:
Expected results:
catalog-operator works well.
Additional info:
Operators can be subscribed successfully.
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
jian learn learn qe-app-registry beta
openshift-logging cluster-logging cluster-logging qe-app-registry stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable
mac:~ jianzhang$
mac:~ jianzhang$ oc get pods -n jian
NAME READY STATUS RESTARTS AGE
552b4660850a7fe1e1f142091eb5e4305f18af151727c56f70aa5dffc1dg8cg 0/1 Completed 0 54m
learn-operator-666b687bfb-7qppm 1/1 Running 0 54m
qe-app-registry-hbzxg 1/1 Running 0 58m
mac:~ jianzhang$ oc get csv -n jian
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded
learn-operator.v0.0.3 Learn Operator 0.0.3 learn-operator.v0.0.2 Succeeded
— Additional comment from jiazha@redhat.com on 2022-06-27 09:58:18 UTC —
Created attachment 1892927
olm must-gather
— Additional comment from jiazha@redhat.com on 2022-06-27 09:59:01 UTC —
Created attachment 1892928
marketplace project must-gather
— Additional comment from jiazha@redhat.com on 2022-06-28 02:05:39 UTC —
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-25-132614 True False 145m Cluster version is 4.11.0-0.nightly-2022-06-25-132614
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-869fb4bd4d-lbhgj 1/1 Running 3 (9m25s ago) 170m
collect-profiles-27606330-4wg5r 0/1 Completed 0 33m
collect-profiles-27606345-lmk4q 0/1 Completed 0 18m
collect-profiles-27606360-mksv6 0/1 Completed 0 3m17s
olm-operator-5f485d9d5f-wczjc 1/1 Running 0 170m
package-server-manager-6cf996b4cc-79lrw 1/1 Running 2 (156m ago) 170m
packageserver-5f668f98d7-2vjdn 1/1 Running 0 165m
packageserver-5f668f98d7-mb2wc 1/1 Running 0 165m
mac:~ jianzhang$ oc get pods catalog-operator-869fb4bd4d-lbhgj -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.34"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.34"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-27T23:13:12Z"
generateName: catalog-operator-869fb4bd4d-
labels:
app: catalog-operator
pod-template-hash: 869fb4bd4d
name: catalog-operator-869fb4bd4d-lbhgj
namespace: openshift-operator-lifecycle-manager
ownerReferences:
)\n\t/build/vendor/golang.org/x/net/http2/pipe.go:76 +0xeb\ngolang.org/x/net/http2.transportResponseBody.Read(
{0x0?},
)\n\t/build/vendor/golang.org/x/net/http2/transport.go:2407
+0x85\nencoding/json.(*Decoder).refill(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:165
+0x17f\nencoding/json.(*Decoder).readValue(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:140
+0xbb\nencoding/json.(*Decoder).Decode(0xc002fc0640,
)\n\t/usr/lib/golang/src/encoding/json/stream.go:63
+0x78\nk8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc003127770,
)\n\t/build/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152
+0x19c\nk8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc003502aa0,
0xc001f9bf10?,
)\n\t/build/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77
+0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00059f700)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
+0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0044dcd40)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
+0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n"
reason: Error
startedAt: "2022-06-28T01:06:59Z"
name: catalog-operator
ready: true
restartCount: 3
started: true
state:
running:
startedAt: "2022-06-28T01:53:53Z"
hostIP: 10.0.190.130
phase: Running
podIP: 10.130.0.34
podIPs:
— Additional comment from jiazha@redhat.com on 2022-06-28 02:09:23 UTC —
mac:~ jianzhang$ oc get pods package-server-manager-6cf996b4cc-79lrw -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.13"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.13"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-27T23:13:10Z"
generateName: package-server-manager-6cf996b4cc-
labels:
app: package-server-manager
pod-template-hash: 6cf996b4cc
name: package-server-manager-6cf996b4cc-79lrw
namespace: openshift-operator-lifecycle-manager
ownerReferences:
— Additional comment from jiazha@redhat.com on 2022-06-28 02:10:02 UTC —
preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
seLinuxOptions:
level: s0:c20,c0
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:
\nsigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:322\nmain.run\n\t/build/cmd/package-server-manager/main.go:67\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963631017e+09\tERROR\tsetup\tfailed
to setup manager instance\t
\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\nError:
Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
connect: connection refused\nencountered an error while executing the binary:
Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
connect: connection refused\n"
reason: Error
startedAt: "2022-06-27T23:26:11Z"
name: package-server-manager
ready: true
restartCount: 2
started: true
state:
running:
startedAt: "2022-06-27T23:26:54Z"
hostIP: 10.0.190.130
phase: Running
podIP: 10.130.0.13
podIPs:
— Additional comment from jiazha@redhat.com on 2022-06-29 08:43:51 UTC —
Observed the error restarts:
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-28-160049 True False 5h57m Cluster version is 4.11.0-0.nightly-2022-06-28-160049
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-7b88dddfbc-rsfhz 1/1 Running 6 (26m ago) 5h51m
collect-profiles-27608160-6m7r6 0/1 Completed 0 37m
collect-profiles-27608175-94n56 0/1 Completed 0 22m
collect-profiles-27608190-nbzcf 0/1 Completed 0 7m55s
olm-operator-5977ffb855-lgfn8 1/1 Running 0 9h
package-server-manager-75db6dcfc-hql4v 1/1 Running 0 9h
packageserver-5955fb79cd-9n56n 1/1 Running 0 9h
packageserver-5955fb79cd-xf6f6 1/1 Running 0 9h
mac:~ jianzhang$ oc get pods catalog-operator-7b88dddfbc-rsfhz -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.121"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.121"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-29T02:46:23Z"
generateName: catalog-operator-7b88dddfbc-
labels:
app: catalog-operator
pod-template-hash: 7b88dddfbc
name: catalog-operator-7b88dddfbc-rsfhz
namespace: openshift-operator-lifecycle-manager
ownerReferences:
, 0xc000cee360)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287
+0x57c fp=0xc003c11f70 sp=0xc003c11648 pc=0x1a3ca7c\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x10000c0008fd6e0?,
, 0xc0004837b8?)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231
+0x45 fp=0xc003c11fb0 sp=0xc003c11f70 pc=0x1a3c4a5\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
+0x32 fp=0xc003c11fe0 sp=0xc003c11fb0 pc=0x1a3c152\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571
+0x1 fp=0xc003c11fe8 sp=0xc003c11fe0 pc=0x4719c1\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
+0x557\n"
reason: Error
startedAt: "2022-06-29T07:56:16Z"
name: catalog-operator
ready: true
restartCount: 6
started: true
state:
running:
startedAt: "2022-06-29T08:11:55Z"
hostIP: 10.0.130.83
phase: Running
podIP: 10.130.0.121
podIPs:
— Additional comment from jiazha@redhat.com on 2022-07-04 03:50:38 UTC —
Please ignore comment 4, 5, they are nothing with this issue.
— Additional comment from jiazha@redhat.com on 2022-07-04 06:57:24 UTC —
Check the `previous` log.
mac:~ jianzhang$ oc logs catalog-operator-f8ddcb57b-j5rf2 --previous
time="2022-07-03T23:49:00Z" level=info msg="log level info"
...
...
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
fatal error: concurrent map writes
fatal error: concurrent map writes
goroutine 559 [running]:
runtime.throw(
)
/usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0xc001f9c508 sp=0xc001f9c4d8 pc=0x43e9f1
runtime.mapassign_faststr(0x1d09880, 0xc0031847b0,
)
/usr/lib/golang/src/runtime/map_faststr.go:295 +0x38b fp=0xc001f9c570 sp=0xc001f9c508 pc=0x419b4b
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.Pod(0xc001f4a900,
,
{0xc00132ccc0, 0x38},
{0xc003582d50, 0x13}, 0xc00452c1e0, 0xc0031847b0, 0x5, ...))
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:125 +0xf9 fp=0xc001f9cc30 sp=0xc001f9cbb0 pc=0x1a42c99
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).currentPodsWithCorrectImageAndSpec(0xc001f9ce68?,
,
{0xc003582d50, 0x13})
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:190 +0x198 fp=0xc001f9ce48 sp=0xc001f9cc30 pc=0x1a437b8
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).CheckRegistryServer(0xc000bcbf80?, 0x493b77?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:453 +0x4c fp=0xc001f9ce88 sp=0xc001f9ce48 pc=0x1a45fcc
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).healthy(0x38ca8453?, 0xc001f4a900)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:196 +0x7e fp=0xc001f9ced0 sp=0xc001f9ce88 pc=0x1a4ae1e
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).health(0x1bc37c0?, 0xc003e7e7e0, 0x8?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:159 +0x2a fp=0xc001f9cf10 sp=0xc001f9ced0 pc=0x1a4ac8a
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).catalogHealth(0xc000a59a90,
)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:137 +0x387 fp=0xc001f9d040 sp=0xc001f9cf10 pc=0x1a4a827
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).Reconcile(0xc000a59a90,
,
{0x7f9f6e5b3328?, 0xc0050f6490?})
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128 +0xc3 fp=0xc001f9d180 sp=0xc001f9d118 pc=0x1a36603
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc0004dfd50,
, 0xc000954720)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287 +0x57c fp=0xc001f9df70 sp=0xc001f9d648 pc=0x1a3ca7c
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x0?,
, 0x0?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231 +0x45 fp=0xc001f9dfb0 sp=0xc001f9df70 pc=0x1a3c4a5
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x32 fp=0xc001f9dfe0 sp=0xc001f9dfb0 pc=0x1a3c152
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc001f9dfe8 sp=0xc001f9dfe0 pc=0x4719c1
created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x557
Seems like it failed at: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227
— Additional comment from agreene@redhat.com on 2022-07-05 16:01:22 UTC —
As Jian pointed out, the catalog operator is failing due to a concurrent write at https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227.
This is happening because:
Line 227 in the reconciler.go directly mutates the catalogSource's annotations. The grpcCatalogSourceDecorator's Annotations function should be returning a copy of the annotations or it should be created with a deepcopy of the catalogSource to avoid mutating an object in the lister cache.
This doesn't seem to be a blocker, but we should get a fix in swiftly.
— Additional comment from jiazha@redhat.com on 2022-07-13 05:02:04 UTC —
1, Create a cluster with the fixed PR via the Cluster-bot.
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest True False 126m Cluster version is 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest
2, Subscribe some operators.
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
default etcd etcd community-operators singlenamespace-alpha
openshift-logging cluster-logging cluster-logging redhat-operators stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator redhat-operators stable
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
default etcd etcd community-operators singlenamespace-alpha
openshift-logging cluster-logging cluster-logging redhat-operators stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator redhat-operators stable
mac:~ jianzhang$
mac:~ jianzhang$
mac:~ jianzhang$ oc get csv -n openshift-operators-redhat
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
mac:~ jianzhang$ oc get csv -n openshift-logging
NAME DISPLAY VERSION REPLACES PHASE
cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
mac:~ jianzhang$ oc get csv -n default
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Succeeded
3, Check OLM catalog-operator pods status.
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-546db7cdf5-7pldg 1/1 Running 0 145m
collect-profiles-27628110-lr2nv 0/1 Completed 0 30m
collect-profiles-27628125-br8b8 0/1 Completed 0 15m
collect-profiles-27628140-m64gp 0/1 Completed 0 38s
olm-operator-754d7f6f56-26qhw 1/1 Running 0 145m
package-server-manager-77d5cbf696-v9w4p 1/1 Running 0 145m
packageserver-6884994d98-2smtw 1/1 Running 0 143m
packageserver-6884994d98-5d7jg 1/1 Running 0 143m
mac:~ jianzhang$ oc logs catalog-operator-546db7cdf5-7pldg --previous
Error from server (BadRequest): previous terminated container "catalog-operator" in pod "catalog-operator-546db7cdf5-7pldg" not found
No terminated container. catalog-operator works well. Verify it.
— Additional comment from aos-team-art-private@redhat.com on 2022-07-13 22:50:04 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from jiazha@redhat.com on 2022-07-18 07:23:08 UTC —
Changed the status to VERIFIED based on comment 10.
Acceptance criteria:
This is a clone of issue OCPBUGS-8205. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7960. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7780. The following is the description of the original issue:
—
Description of problem:
4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.
Version-Release number of selected component (if applicable):
4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.
How reproducible:
100%
Steps to Reproduce:
1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.
Actual results:
null
Expected results:
{"baselineCapabilitySet":"None"}
This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:
—
Description of problem:
Unit test failing === RUN TestNewAppRunAll/app_generation_using_context_dir newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby' --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)
Version-Release number of selected component (if applicable):
How reproducible:
100
Steps to Reproduce:
see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648
Actual results:
unit tests fail
Expected results:
TestNewAppRunAll unit test should pass
Additional info:
Description of problem:
Jenkins and Plugin versions need to be updated to mitigate pending CVEs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
+++ This bug was initially created as a clone of
Bug #2070318
+++
Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.
2.Start cluster installation.
3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.
Actual results:
After 10-15 minutes OCP will correct that by itself.
Expected results:
Wrong natting should not happen.
Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —
This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.
— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —
Hi Ben,
We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.
— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —
Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.
— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —
Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.
— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —
All nodes have two interfaces:
eth0: 10.30.1.0/24
eth1: 10.40.1.0/24
machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1
The kubeapi service ip is 172.30.0.1:443
all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)
To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.
Initially kubeapi is only active on the bootstrap node so there should be a nat rule like
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)
However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)
The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.
So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.
Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.
— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —
Note from Juniper regarding requested SOS report:
In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ
— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —
Can we please get an update on this BZ?
Do let us know if there is any other information needed.
— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —
Can you please provide another link to the sosreport? Looks like the link is dead.
— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —
See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —
Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:
[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h
I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?
— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —
Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.
— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —
Hi Tim,
You are right, kube-proxy is deployed by default and we don't change that behavior.
There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.
— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —
Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.
— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —
Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)
vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)
So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.
— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —
Created attachment 1875023 [details]sosreport
— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —
Created attachment 1875024 [details]sosreport-part2
Hi Tim,
We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.
I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.
Here are some notes about the cluster:
First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.
[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd
We see the behavior that wrong nat'ing was done at the beginning, then corrected later:
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM
}
sh-4.4#
— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —
@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.
— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —
Created support Case 03223143.
— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —
Hello Team,
Any update on this?
Thanks,
Vinu K
— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —
This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.
I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.
Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.
[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —
@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?
— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —
Sorry, I missed that this came back to me.
(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?
I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.
— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —
(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.
— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —
Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.
— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —
Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):
We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.
Thank you
— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —
Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.
I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?
Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.
— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —
Moving to the assisted-installer component for further investigation.
— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —
@Eran Cohen:
Please see my response inline.
You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.
You are trying to form a cluster from OpenStack Vms?
--> Yes.
You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.
Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.
— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —
@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture
— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —
OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'
To be honest I'm surprised that the installation was completed successfully.
@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?
— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —
@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:
— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —
@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather
— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —
Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506
Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#
— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —
@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing
— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —
@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).
We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.
— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —
For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":
,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [
],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},
— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —
The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.
— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —
@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.
— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —
I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.
— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —
I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.
— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —
=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —
//EMT note//
Update from our consultant Manuel Martinez Briceno -
====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====
Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat
This is a clone of issue OCPBUGS-1678. The following is the description of the original issue:
—
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
Description of problem:
When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.
Version-Release number of selected component (if applicable):
How reproducible:
with big exclude ranges, 100%
Steps to Reproduce:
1. create network-attachment-definition with a large range: $ cat <<EOF| oc apply -f - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: nad-w-excludes spec: config: |- { "cniVersion": "0.3.1", "name": "macvlan-net", "type": "macvlan", "master": "ens3", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "fd43:01f1:3daa:0baa::/64", "exclude": [ "fd43:01f1:3daa:0baa::/100" ], "log_file": "/tmp/whereabouts.log", "log_level" : "debug" } } EOF 2. create a pod with the network attached: $ cat <<EOF|oc apply -f - apiVersion: v1 kind: Pod metadata: name: pod-with-exclude-range annotations: k8s.v1.cni.cncf.io/networks: nad-w-excludes spec: containers: - name: pod-1 image: openshift/hello-openshift EOF 3. check pod status, event log and whereabouts logs after a while: $ oc get pods NAME READY STATUS RESTARTS AGE pod-with-exclude-range 0/1 ContainerCreating 0 2m23s $ oc get events <...> 6m39s Normal Scheduled pod/pod-with-exclude-range Successfully assigned default/pod-with-exclude-range to <worker-node> 6m37s Normal AddedInterface pod/pod-with-exclude-range Add eth0 [10.129.2.49/23] from openshift-sdn 2m39s Warning FailedCreatePodSandBox pod/pod-with-exclude-range Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded $ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log Starting pod/<worker-node>-debug ... To use host binaries, run `chroot /host` 2022-10-27T14:14:50Z [debug] Finished leader election 2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil> 2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf 2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 2022-10-27T14:14:59Z [debug] Started leader election 2022-10-27T14:14:59Z [debug] OnStartedLeading() called 2022-10-27T14:14:59Z [debug] Elected as leader, do processing 2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range 2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff
Actual results:
Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Expected results:
additional network gets attached to the pod
Additional info:
Description of problem:
Custonmer is facing this problem in OCP 4.10.40. https://github.com/coredns/coredns/issues/5593 Its root cause seems to be this bug in k8s code: https://github.com/kubernetes/kubernetes/issues/109115 https://github.com/kubernetes/kubernetes/pull/109137 The issue seems to be fixed in OpenShift 4.11, but the customer can't update at this moment. Can this fix be backported to OpenShift 4.10?
Version-Release number of selected component (if applicable):
4.10.40
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.
Description of problem:
Users on a fully-disconnected cluster could not see Devfiles in the developer catalog or import a Devfiles. That's fine.
But the API calls /api/devfile/samples/ and /api/devfile/ takes 30 seconds until they fail with a 504 Gateway timeout error.
If possible they should fail immediately.
Version-Release number of selected component (if applicable):
This might happen since 4.8
Tested this yet only on 4.12.0-0.nightly-2022-09-07-112008
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
The console Pod log contains this error:
E0909 10:28:18.448680 1 devfile-handler.go:74] Failed to parse devfile: failed to populateAndParseDevfile: Get "https://registry.devfile.io/devfiles/go": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
+++ This bug was initially created as a clone of OCPBUGSM-46761 +++
Description of problem:
When using the admin console, under "Cluster Settings" and choosing Upstream Configuration, the "window" that appears has a dead link to documentation for how to create a local (disconnected) update server.
The link in the window is
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/updating_clusters/installing-update-service
Assume the right link should be something like:
https://docs.openshift.com/container-platform/4.10/updating/updating-restricted-network-cluster.html#update-restricted-network-cluster-update-service
Version-Release number of selected component (if applicable):
4.10.*
— Additional comment from plarsen@redhat.com on 2022-06-17 16:33:53 UTC —
Created attachment 1890947
Screenshot showing the "window" with the 404 link
— Additional comment from rhamilto@redhat.com on 2022-06-28 19:56:17 UTC —
Thank you, Peter, for wonderfully documenting the bug!
This bug is a backport clone of [Bugzilla Bug 2076646](https://bugzilla.redhat.com/show_bug.cgi?id=2076646). The following is the description of the original bug:
—
openshift-install destroy unable to delete PVC disks in GCP if cluster identifier is longer than 22 characters
Version:
$ openshift-install version
$ ./openshift-install 4.8.18
built from commit bd366e3cdcf892e1bddd841c702738f5254a0188
release image quay.io/openshift-release-dev/ocp-release@sha256:321aae3d3748c589bc2011062cee9fd14e106f258807dc2d84ced3f7461160ea
Platform: GCP
Installation Type: IPI
What happened?
#When run the openshift-install destroy cluster command, it is observed that PVC disks are not getting deleted, if the metadata.name is more than 22 characters.
What did you expect to happen?
All resources should get deleted successfully with openshift-installer destroy command.
How to reproduce it (as minimally and precisely as possible)?
$ Setup IPI GCP cluster
$ Provide cluster name with 22 chars.
$ Use standard (default) storage class, create pvc and pv.
$ Once done, destroy the cluster
$ Check on the backend platform if the storage disk for PVC is getting deleted or not.
Anything else we need to know?
We deployed an OpenShift 4 cluster in GCP, the `.metadata.name` field in the install config was gcpuser-a.ocp.redhat. The installer adds a unique identifier to the name for the InfraID, in our case, it resulted in `gcpusc1-a-ops-xpaas-nkp6w`.
After the cluster was provisioned, we created a PVC. The corresponding Google cloud disk followed the name `gcpuser-a.ocp.redhat-nk-pvc-<UID>`. Because the disk name did not exactly match the InfraID, when we ran the openshift-install destroy for this cluster, none of the disks for PVCs were deleted.
This bug is a backport clone of [Bugzilla Bug 1983056](https://bugzilla.redhat.com/show_bug.cgi?id=1983056). The following is the description of the original bug:
—
Description of problem:
During upgrade of 4.5.40 to 4.6.31 the CNI is restarting due to unable to plug the VIF provided as it is already being used by another Pod.
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params:
{'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0420f2a3-d2fe-40e6-86f0-9a38a17c933a', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID': '73eee9240ae6bcfec8b539fa2b12c8e82f51f8a95f29aaaedc95e4e05f7cb734', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_NAME=prometheus-k8s-0;K8S_POD_INFRA_CONTAINER_ID=73eee9240ae6bcfec8b539fa2b12c8e82f51f8a95f29aaaedc95e4e05f7cb734'}: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 82, in add
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service vif = self.plugin.add(params)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 75, in add
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service vifs = self._do_work(params, b_base.connect, timeout)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 184, in _do_work
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 156, in connect
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service driver.connect(vif, ifname, netns, container_id)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 126, in connect
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service iface.net_ns_fd = utils.convert_netns(netns)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/transactional.py", line 209, in _exit_
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service self.commit()
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 650, in commit
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service raise newif
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 589, in commit
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service self.nl.link('add', **request)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 1163, in link
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service msg_flags=msg_flags)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 373, in nlm_request
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service return tuple(self._genlm_request(*argv, **kwarg))
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 864, in nlm_request
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service callback=callback):
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 376, in get
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service return tuple(self._genlm_get(*argv, **kwarg))
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 701, in get
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service raise msg['header']['error']
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service
2021-07-16 10:55:02.585 232 INFO werkzeug [-] 127.0.0.1 - - [16/Jul/2021 10:55:02] "POST /addNetwork HTTP/1.1" 500 -
2021-07-16 10:55:02.656 251 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:c1:cd:25,has_traffic_filtering=False,id=88bdb7f9-65e6-4c54-83d1-73341876da08,network=Network(cc5c0761-5f89-42b8-a4fc-0d829eba818d),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap88bdb7f9-65',vlan_id=2482)
The prometheus Pod is configured to used the same IP as the alert Pod, and the alert Pod is using IP different than the one specified on annotation:
[stack@undercloud-0 ~]$ oc get po prometheus-k8s-0 -n openshift-monitoring -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
openshift.io/scc: anyuid
openstack.org/kuryr-pod-label: '
'
openstack.org/kuryr-vif: '{"versioned_object.changes": ["default_vif"], "versioned_object.data":
{"additional_vifs": {}, "default_vif": {"versioned_object.changes": ["has_traffic_filtering",
"plugin", "active", "vif_name", "preserve_on_delete", "network", "id", "address",
"vlan_id"], "versioned_object.data": {"active": true, "address": "fa:16:3e:c1:cd:25",
"has_traffic_filtering": false, "id": "88bdb7f9-65e6-4c54-83d1-73341876da08",
"network": {"versioned_object.changes": ["mtu", "multi_host", "subnets", "label",
"id", "should_provide_bridge", "should_provide_vlan"], "versioned_object.data":
{"id": "cc5c0761-5f89-42b8-a4fc-0d829eba818d", "label": "ns/openshift-monitoring-net",
"mtu": 1442, "multi_host": false, "should_provide_bridge": false, "should_provide_vlan":
false, "subnets": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["ips", "gateway", "routes", "cidr",
"dns"], "versioned_object.data": {"cidr": "10.128.8.0/23", "dns": [], "gateway":
"10.128.8.1", "ips": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["address"], "versioned_object.data":
, "versioned_object.name": "FixedIP", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}]}, "versioned_object.name": "FixedIPList",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"},
"routes": {"versioned_object.changes": ["objects"], "versioned_object.data":
, "versioned_object.name": "RouteList", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}}, "versioned_object.name": "Subnet",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"}]},
"versioned_object.name": "SubnetList", "versioned_object.namespace": "os_vif",
"versioned_object.version": "1.0"}}, "versioned_object.name": "Network", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.1"}, "plugin": "noop", "preserve_on_delete":
false, "vif_name": "tap88bdb7f9-65", "vlan_id": 2482}, "versioned_object.name":
"VIFVlanNested", "versioned_object.namespace": "os_vif", "versioned_object.version":
"1.0"}}, "versioned_object.name": "PodState", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}'
creationTimestamp: "2021-07-15T12:24:52Z"
generateName: prometheus-k8s-
labels:
app: prometheus
controller-revision-hash: prometheus-k8s-5949f47544
prometheus: k8s
statefulset.kubernetes.io/pod-name: prometheus-k8s-0
name: prometheus-k8s-0
namespace: openshift-monitoring
ownerReferences:
[stack@undercloud-0 ~]$ oc get po -A -o wide |grep 10.128.9.175
openshift-monitoring alertmanager-main-2 5/5 Running 0 22h 10.128.9.175 ostest-f57bt-worker-vprrk <none> <none>
[stack@undercloud-0 ~]$ oc get po alertmanager-main-2 -n openshift-monitoring -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "kuryr",
"interface": "eth0",
"ips": [
"10.128.9.175"
],
"mac": "fa:16:3e:c1:cd:25",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "kuryr",
"interface": "eth0",
"ips": [
"10.128.9.175"
],
"mac": "fa:16:3e:c1:cd:25",
"default": true,
"dns": {}
}]
openshift.io/scc: anyuid
openstack.org/kuryr-pod-label: '
'
openstack.org/kuryr-vif: '{"versioned_object.changes": ["default_vif"], "versioned_object.data":
{"additional_vifs": {}, "default_vif": {"versioned_object.changes": ["active",
"has_traffic_filtering", "network", "address", "id", "preserve_on_delete", "vlan_id",
"plugin", "vif_name"], "versioned_object.data": {"active": true, "address":
"fa:16:3e:77:a3:12", "has_traffic_filtering": false, "id": "f6dd52db-40e1-4339-a7e6-1e2bd2f6f772",
"network": {"versioned_object.changes": ["multi_host", "label", "should_provide_vlan",
"should_provide_bridge", "mtu", "id", "subnets"], "versioned_object.data": {"id":
"cc5c0761-5f89-42b8-a4fc-0d829eba818d", "label": "ns/openshift-monitoring-net",
"mtu": 1442, "multi_host": false, "should_provide_bridge": false, "should_provide_vlan":
false, "subnets": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["routes", "dns", "cidr", "gateway",
"ips"], "versioned_object.data": {"cidr": "10.128.8.0/23", "dns": [], "gateway":
"10.128.8.1", "ips": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["address"], "versioned_object.data":
, "versioned_object.name": "FixedIP", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}]}, "versioned_object.name": "FixedIPList",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"},
"routes": {"versioned_object.changes": ["objects"], "versioned_object.data":
, "versioned_object.name": "RouteList", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}}, "versioned_object.name": "Subnet",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"}]},
"versioned_object.name": "SubnetList", "versioned_object.namespace": "os_vif",
"versioned_object.version": "1.0"}}, "versioned_object.name": "Network", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.1"}, "plugin": "noop", "preserve_on_delete":
false, "vif_name": "tapf6dd52db-40", "vlan_id": 3914}, "versioned_object.name":
"VIFVlanNested", "versioned_object.namespace": "os_vif", "versioned_object.version":
"1.0"}}, "versioned_object.name": "PodState", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}'
creationTimestamp: "2021-07-15T12:23:41Z"
generateName: alertmanager-main-
labels:
alertmanager: main
app: alertmanager
controller-revision-hash: alertmanager-main-5548759bbd
statefulset.kubernetes.io/pod-name: alertmanager-main-2
name: alertmanager-main-2
namespace: openshift-monitoring
(shiftstack) [stack@undercloud-0 ~]$ openstack port list |grep 10.128.9.175
88bdb7f9-65e6-4c54-83d1-73341876da08 | fa:16:3e:c1:cd:25 | ip_address='10.128.9.175', subnet_id='a4ee6044-8ddd-4dbf-bcd3-22f95ec4ce16' | ACTIVE |
(shiftstack) [stack@undercloud-0 ~]$ openstack port list |grep 10.128.9.238
f6dd52db-40e1-4339-a7e6-1e2bd2f6f772 | fa:16:3e:77:a3:12 | ip_address='10.128.9.238', subnet_id='a4ee6044-8ddd-4dbf-bcd3-22f95ec4ce16' | ACTIVE |
(shiftstack) [stack@undercloud-0 ~]$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.31 True False False 22h
cloud-credential 4.6.31 True False False 26h
cluster-autoscaler 4.6.31 True False False 25h
config-operator 4.6.31 True False False 25h
console 4.6.31 True False False 22h
csi-snapshot-controller 4.6.31 True False False 25h
dns 4.5.40 True False False 25h
etcd 4.6.31 True False False 25h
image-registry 4.6.31 True False False 25h
ingress 4.6.31 True False False 22h
insights 4.6.31 True False False 25h
kube-apiserver 4.6.31 True False False 25h
kube-controller-manager 4.6.31 True False False 25h
kube-scheduler 4.6.31 True False False 25h
kube-storage-version-migrator 4.6.31 True False False 25h
machine-api 4.6.31 True False False 25h
machine-approver 4.6.31 True False False 25h
machine-config 4.5.40 True False False 23h
marketplace 4.6.31 True False False 22h
monitoring 4.5.40 False True True 22h
network 4.5.40 True True False 25h
node-tuning 4.6.31 True False False 22h
openshift-apiserver 4.6.31 True False False 25h
openshift-controller-manager 4.6.31 True False False 22h
openshift-samples 4.6.31 True False False 22h
operator-lifecycle-manager 4.6.31 True False False 25h
operator-lifecycle-manager-catalog 4.6.31 True False False 25h
operator-lifecycle-manager-packageserver 4.6.31 True False False 22h
service-ca 4.6.31 True False False 25h
storage 4.6.31 True False False 22h
(shiftstack) [stack@undercloud-0 ~]$ oc get po -A -o wide |grep 10.128.9.238 |wc -l
0
The same issue would be possible on 3.11 as it's also based on Annotations.
Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.6 GA
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 2075091](https://bugzilla.redhat.com/show_bug.cgi?id=2075091). The following is the description of the original bug:
—
Symptom Detection.Undiagnosed panic detected in pod
is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod
This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected:
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124
After that, it mysteriously disappeared.
Here is a specific case:
Message from the test case:
{ pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError
{x:4, y:4, signed:true, code:0x0}(runtime error: index out of range [4] with length 4)}
E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError
{x:4, y:4, signed:true, code:0x0} (runtime error: index out of range [4] with length 4)
goroutine 77 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(
)
/usr/lib/golang/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00)
/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375
k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1(
)
/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0,
)
/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25
k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0,
,
{0x1a373f8, 0xc0011c24c0}, 0xc000623d60, 0xc0005ff380, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0,
, 0x1, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8
panic: runtime error: index out of range [4] with length 4 [recovered]
panic: runtime error: index out of range [4] with length 4
It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge.
Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data.
The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders.
This is a clone of issue OCPBUGS-2083. The following is the description of the original issue:
—
Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Tracker bug for bootimage bump in 4.10. This bug should block bugs which need a bootimage bump to fix.
Description of problem:
We observed that a dual stack cluster deployed with AI gui only fails. This cluster is dhcp for ipv4, RA/RS autoconfiguration for ipv6. It fails with error in the onvkube container ``` I0906 07:45:43.044090 87450 gateway_init.go:261] Initializing Gateway Functionality I0906 07:45:43.046398 87450 gateway_localnet.go:152] Node local addresses initialized to: map[10.131.31.214:{10.131.31.208 fffffff0} 10.255.0.2:{10.255.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 2001:1b74:480:613a:f6e9:d4ff:fef1:6f26:{2001:1b74:480:613a:: ffffffffffffffff0000000000000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fd01:0:0:1::2:{fd01:0:0:1:: ffffffffffffffff0000000000000000} fe80::8ce9:b4ff:fe1a:1208:{fe80:: ffffffffffffffff0000000000000000} fe80::c8ef:ecff:fee3:64c7:{fe80:: ffffffffffffffff0000000000000000} fe80::f6e9:d4ff:fef1:6f26:{fe80:: ffffffffffffffff0000000000000000}] I0906 07:45:43.047759 87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7 I0906 07:45:43.048045 87450 helper_linux.go:97] Found default gateway interface br-ex 10.131.31.209 I0906 07:45:43.048152 87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7 F0906 07:45:43.048318 87450 ovnkube.go:133] failed to get default gateway interface ``` on the node we observed that there is multi-path entry during ``` default proto ra metric 48 pref medium nexthop via fe80::e2f6:2d01:ab14:ec71 dev br-ex weight 1 nexthop via fe80::e2f6:2d01:ab11:c271 dev br-ex weight 1 ``` I manually remove one of the entries (`ip route delete`) and then delete the ovnkube-node pod. Then the installation continues, container works. Every time there is multiple entry, if the onvkube-node starts, it fails.
Version-Release number of selected component (if applicable):
4.10.30
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
There might a side issue: the interface of the node upon boot takes time to get the ipv6 autoconfiguration, no RS packets seemed to be sent out (observed zero on all routers).
Description of problem:
Remove the self-provisioner role for the system authenticated users as per https://access.redhat.com/solutions/4040541 to stop users from having the ability to create new projects, but the customer has found this is only partially working. It appears that when you use cluster Web UI Administrator view, the "Create Project" button is not available but switching to the default Developer view default user can create a project
Version-Release number of selected component (if applicable):
How reproducible:
Follow https://access.redhat.com/solutions/1529893
Steps to Reproduce:
1. oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth 2. log back in as user and switch between admin/Dev view 3. User still has link showing in Dev console
Actual results:
Create new project link still exists
Expected results:
Create new project link should be removed, similar to Admin Console
Additional info:
Although the loink still exists, the user get's a correct permission denied message.
Description of problem:
Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host.
https://github.com/vmware/govmomi/issues/1026
Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead.
PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task.
https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn:
As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead.
Version-Release number of selected component (if applicable):
4.8.x
How reproducible:
Always
Steps to Reproduce:
1.
2.
3.
Actual results:
Sometimes powers on fails requiring manual intervention.
Expected results:
PowerOn should use DRS to ensure it's always successful.
Additional info:
Description of problem:
Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/ these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future See OCPBUGS-11636 for additional information
Version-Release number of selected component (if applicable):
How reproducible:
Reproducible in affected regions.
Steps to Reproduce:
1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.
Actual results:
./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output 2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private 2023/04/11 13:01:06 Copying signing key for use by installer 2023/04/11 13:01:07 Bucket abutchertestue2-oidc created 2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=
Expected results:
"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.
Additional info:
Description of problem:
Install failed if specify sts endpoint in the install-config.yaml file:
platform:
aws:
region: us-east-2
serviceEndpoints:
- name: sts
url: https://sts.us-east-2.amazonaws.com
Errors:
level=error msg=Error: error configuring Terraform AWS Provider: error validating provider credentials: error calling sts:GetCallerIdentity: SignatureDoesNotMatch: Credential should be scoped to a valid region. level=error msg= status code: 403, request id: f4e877fe-9e90-4cba-a455-2538d489a8d0 level=error level=error msg= on ../../../../../../../../tmp/openshift-install-cluster-1552617948/main.tf line 11, in provider "aws": level=error msg= 11: provider "aws" { level=error level=error level=error msg=Failed to read tfstate: open /tmp/openshift-install-cluster-1552617948/terraform.cluster.tfstate: no such file or directory
Version-Release number of selected component (if applicable):
4.10.z
How reproducible:
* Always
Steps to Reproduce:
1. Create an install-config.yaml, and specify sts endpoint 2. Create an STS cluster
Actual results:
Error: error configuring Terraform AWS Provider: error validating provider credentials: error calling sts:GetCallerIdentity: SignatureDoesNotMatch: Credential should be scoped to a valid region.
Expected results:
No errors, create cluster successfully
Additional info:
No such issues on 4.8, 4.9, 4.11, 4.12
This is a clone of issue OCPBUGS-8399. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7474. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6714. The following is the description of the original issue:
—
Description of problem:
Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46
a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.
More description about issue is available in private comment below since it contains customer data.
Description of problem:
Currently when installing Openshift on the Openstack cluster name length limit is allowed to 14 characters. Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters. Version : OCP 4.8.5 UPI Disconnected Environment : Openstack 16 Issue: User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters. Error events : ~~~ fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []} ~~~
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack
Expected results:
The 14 characters limit should be change for the OCP clustername on Openstack
Additional info:
+++ This bug was initially created as a clone of Bug #2081562 +++
Description of problem:
lifecycle.posStart does not have network connectivity on OpenShiftSDN CNI. (OVNKubernetes does not have the issue)
Version-Release number of selected component (if applicable):
4.10
How reproducible:
always
Steps to Reproduce:
1. create statefulset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ oc create -f statefulset.yaml
$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:
Actual results:
PostStartHook fails
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36s Normal Killing pod/httpd-0 FailedPostStartHook
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expected results:
PostStartHook should not fail.
Additional info:
by adding a dummy initContainers, you can workaround the issue.
something like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
spec:
initContainers:
— Additional comment from rphillips@redhat.com on 2022-05-11 19:48:10 UTC —
crio's contract with networking is to have networking up when the container starts. Moving to the openshift-sdn team to help triage what is going on.
— Additional comment from hyoskim@redhat.com on 2022-06-09 00:40:33 UTC —
Hello,
Is there any update on this issue?
— Additional comment from npinaeva@redhat.com on 2022-06-09 07:53:38 UTC —
Hello, yeah we found the root cause and working on the fix now - PR should be ready by the end of the week
— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-07-24 15:21:48 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.
— Additional comment from errata-xmlrpc@redhat.com on 2022-07-27 00:18:40 UTC —
This bug has been added to advisory RHSA-2022:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)
— Additional comment from swasthan@redhat.com on 2022-07-27 05:38:55 UTC —
Hello Team, thank you for the help so far!
May we know if this is going to backport in v4.10.z as well?
Regards,
Swadeep
— Additional comment from zzhao@redhat.com on 2022-07-27 06:40:30 UTC —
this fixed PR is merged to build 4.12.0-0.nightly-2022-07-24-180529
So I update the target version to 4.12 version.
— Additional comment from zzhao@redhat.com on 2022-07-27 06:48:33 UTC —
still failed on build 4.12.0-0.nightly-2022-07-26-131732
Creating above statefulset and pod still cannot be worked with same error
27s Warning FailedPostStartHook pod/httpd-0 Exec lifecycle hook ([/bin/sh -c curl -k https://<IP:PORT> > /tmp/urltest.txt]) for Container "httpd" in Pod "httpd-0_default(7e519841-7092-4513-928b-03c7783ddc7d)" failed - error: command '/bin/sh -c curl -k https://<IP:PORT> > /tmp/urltest.txt' exited with 1: /bin/sh: -c: line 0: syntax error near unexpected token `>'...
85s Normal Killing pod/httpd-0 FailedPostStartHook
— Additional comment from npinaeva@redhat.com on 2022-07-27 12:50:53 UTC —
Hey @zzhao@redhat.com can you share full statefulset yaml you're running?
Doesn't "line 0: syntax error near unexpected token `>'..." mean bash command is wrong?
— Additional comment from zzhao@redhat.com on 2022-07-27 13:36:43 UTC —
(In reply to Nadia Pinaeva from comment #9)
> Hey @zzhao@redhat.com can you share full statefulset yaml you're running?
> Doesn't "line 0: syntax error near unexpected token `>'..." mean bash
> command is wrong?
I'm using the statefulset from comment 0
$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:
— Additional comment from npinaeva@redhat.com on 2022-07-27 14:02:54 UTC —
Did you replace <IP:PORT> here "curl -k https://<IP:PORT> > /tmp/urltest.txt"?
— Additional comment from zzhao@redhat.com on 2022-07-28 07:43:25 UTC —
(In reply to Nadia Pinaeva from comment #11)
> Did you replace <IP:PORT> here "curl -k https://<IP:PORT> >
> /tmp/urltest.txt"?
oh my bad
Tested again after replacing the ip and port with following
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:
—
on 4.12.0-0.nightly-2022-07-27-133042
$ oc get pod
NAME READY STATUS RESTARTS AGE
httpd-0 1/1 Running 0 2m28s
— Additional comment from npinaeva@redhat.com on 2022-07-29 13:02:53 UTC —
@swasthan@redhat.com yes, we are going to backport it to 4.10 (hopefully it will be faster than the fix itself )
Description of problem:
When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart.
The restart happens because of (ovnkube-master pods)
2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-09-114621
How reproducible:
Steps to Reproduce:
1. Run node-density on a 120 node cluster
Actual results:
Spike observed in pod-latency graph ~22s
Expected results:
Steady pod-latency graph ~4s
Additional info: