Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Problem:
Certain Insights Advisor features differentiate between RHEL and OCP advisor
Goal:
Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.
Scope:
Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432
This contains all the Insights Advisor widget deliverables for the OCP release 4.11.
Scope
It covers only minor bug fixes and improvements:
Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.
Expected for 4.11 OCP release.
Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis Given: OCP WebConsole UI and the cluster dashboard is accessible And: CCX external data pipeline is in a working state And: administrator A1 has access to his cluster's dashboard And: Insights Operator for this cluster is sending archives When: administrator A1 clicks on the Insights Advisor widget Then: the results of the last analysis are showed in the Insights Advisor widget And: the time of the last analysis is shown in the Insights Advisor widget
Acceptance criteria:
max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.
Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.
CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.
DoD
CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.
DoD
Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet.
The prometheus-operator exposes retentionSize.
Field | Description |
---|---|
retentionSize | Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB. |
This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.
Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.
That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.
We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.
Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.
Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.
Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.
Team A wants to send all their important notifications to a specific Slack channel.
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.
DoD:
DoD
DoD
Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]
Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS
What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.
Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.
Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/
Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.
The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.
We will extend CMO's configuration API to support the following authentications with remote write:
Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).
Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
Authorization:
type: Bearer
credentials:
name: credentials
key: token
DoD:
Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
remoteWrite:
- url: "https://remote-write.endpoint"
sigv4:
accessKey:
name: aws-credentialss
key: access
secretKey:
name: aws-credentials
key: secret
profile: "SomeProfile"
roleArn: "SomeRoleArn"
DoD:
As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.
Configure audit logging to capture login, logout and login failure details
TODO(PM): update this
Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.
Expected results: Login, logout and login failures should be captured in audit logging.
The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.
When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.
Right now there's no way to generate audit logs from this.
Let the Cluster Authentication Operator deliver the policy to OAuthServer.
In order to know if authn events should be logged, OAuthServer needs to be aware of it.
* Stanislav LázničkaCreate an observer to deliver the audit policy to the oauth server
Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.
Right now there's no way to generate audit logs from this.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.
While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.
For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift/builder to k8s 1.24
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
As a deployer, I want to be able to:
so that I can achieve
Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.
In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
The AWS-specific code added in OCPPLAN-6006 needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.
Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.
Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.
On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
After that, we can remove the experimental flag and make this a GA feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
RFE-1101 described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.
RFE-2012 aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.
Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.
The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.
Acceptance Criteria:
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Feature --->
<--- Remove the descriptive text as appropriate --->
Problem
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Running the OPCT with the latest version (v0.1.0) on OCP 4.11.0, the openshift-tests is reporting an incorrect counter for the "total" field.
In the example below, after the 1127th test, the total follows the same counter of executed. I also would assume that the total is incorrect before that point as the test continues the execution increases both counters.
openshift-tests output format: [failed/executed/total]
started: (0/1126/1127) "[sig-storage] PersistentVolumes-expansion loopback local block volume should support online expansion on node [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (38s) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1127/1127) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" passed: (6.6s) 2022-08-09T17:12:21 "[sig-storage] Downward API volume should provide container's memory request [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]" started: (0/1128/1128) "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (immediate binding)] topology should fail to schedule a pod which has topologies that conflict with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]" skip [k8s.io/kubernetes@v1.24.0/test/e2e/storage/framework/testsuite.go:116]: Driver local doesn't support GenericEphemeralVolume -- skipping Ginkgo exit error 3: exit with code 3 skipped: (400ms) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]" started: (0/1129/1129) "[sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Dynamic PV (default fs)] capacity provides storage capacity information [Suite:openshift/conformance/parallel] [Suite:k8s]"
OPCT output format [executed/total (failed failures)]
Tue, 09 Aug 2022 14:12:13 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1112/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:23 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1120/1127 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:33 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1139/1139 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:43 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1185/1185 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor... Tue, 09 Aug 2022 14:12:53 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE openshift-conformance-validated | running | | 1188/1188 (0 failures) | status=running openshift-kube-conformance | complete | | 352/352 (0 failures) | waiting for post-processor...
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.
Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.
The PR should include the following:
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.
To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories. Two common libraries are Go's resolver and glibc's resolver. A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension. It would also make sense to test Java or other popular languages or runtimes that have their own resolvers.
Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.
The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications.
In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".
Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.
PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).
The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.
The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.
It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).
Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.
Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)
This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.
It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.
Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.
Goal
Add support for PDB (Pod Disruption Budget) to the console.
Requirements:
Designs:
During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.
The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.
Example:
Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster
Note: We should consider to backport this to 4.10
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.
Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes.
Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc. It is all at a pool level, they will not be able to choose specific hosts.
Requirements
Design deliverables:
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an OpenShift engineer,
I want to initialize a validating admission webhook for the shared resource CSI driver
So that I can eventually require readOnly: true to be set on all pods that use the Shared Resource CSI Driver
None.
None.
None.
This is a prerequisite for implementing the validating admission webhook.
We need to have ART build the container image downstream so that we can add the correct image references for the CVO.
If we reference images in the CVO manifests which do not have downstream counterparts, we break the downstream build for the payload.
CI is capable of producing multiple images for a GitHub repository. For example, github.com/openshift/oc produces 4-5 images with various capabilities.
We did similar work in BUILD-234 - some of these steps are not required.
See also:
Tasks:
As a developer using SharedSecrets and ConfigMaps
I want to ensure all pods set readOnly; true on admission
So that I don't have pods stuck in the "Pending" state because of a bad volume mount
QE will need to verify the new Pod Admission behavior
Docs will need to ensure that readOnly: true is required and must be set to true.
None.
QE testing/verification of the feature - require readOnly to be true
Actions:
1. Create smoke test and submit to GitHub
2. Run script to integrate smoke test with Polarion
As an OpenShift engineer
I want the shared resource CSI Driver webhook to be installed with the cluster storage operator
So that the webhook is deployed when the CSI driver is deployed
None - no new functional capabilities will be added
None - we can verify in CI that we are deploying the webhook correctly.
None - no new functional capabilities will be added
The scope of this story is to just deploy the "hello world" webhook with the Cluster Storage Operator.
Adding the live ValidatingWebhook configuration and service will be done in a separate story.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.
This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift
Enhancement - https://github.com/openshift/enhancements/pull/1010
ingress-operator must comply to pod security. The current audit warning is:
{ "objectRef": "openshift-ingress-operator/deployments/ingress-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }
dns-operator must comply to restricted pod security level. The current audit warning is:
{ "objectRef": "openshift-dns-operator/deployments/dns-operator", "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.
showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects.
See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:
To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:
For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)
Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:
In general, anything that changes a cluster version should be read only.
Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).
Summary of changes to the overview page:
Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#
Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".
If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).
PatternFly Dark Theme Handbook: https://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit
Admin Console -> Workloads & Pods
Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run
As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.
Acceptance criteria:
As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.
Acceptance criteria:
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.
Currently, you need to navigate to
Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins
to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.
AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:
We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart.
This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.
AC:
In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded.
In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654
AC:
Follow up of https://issues.redhat.com/browse/CONSOLE-3159
We need to provide a base for running integration tests using the dynamic plugins. The tests should initially
Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.
https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin
https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md
https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk
Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.
The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.
AC:
Goal
Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267
High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema
Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.
The oc implementation landed via https://github.com/openshift/oc/pull/961.
Design
See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#
See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932
The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.
Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in update path visualization.
Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.
Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html
Acceptance criteria:
Open Questions:
As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.
Remove Jenkins from the OCP Payload.
See epic linking - need alternative non payload image available to provide relatively seamless migration
Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md
PARTIAL ANSWER ^^: confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.
assuming none
As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.
As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.
As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.
High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)
NONE ... QE should wait until JNKS-254
NONE
NONE
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Possible staging
1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image
2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set
3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`
https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster
After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.
DoD:
Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.
The limits that we want to look into for OCP are the following ones:
# Per-scrape limit on number of labels that will be accepted for a sample. If # more than this number of labels are present post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_limit: <int> | default = 0 ] # Per-scrape limit on length of labels name that will be accepted for a sample. # If a label name is longer than this number post metric-relabeling, the entire # scrape will be treated as failed. 0 means no limit. [ label_name_length_limit: <int> | default = 0 ] # Per-scrape limit on length of labels value that will be accepted for a sample. # If a label value is longer than this number post metric-relabeling, the # entire scrape will be treated as failed. 0 means no limit. [ label_value_length_limit: <int> | default = 0 ]
We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.
DoD:
When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.
Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.
Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.
We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.
Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.
I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.
The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.
The potential target ServiceMonitors are:
As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).
See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit
As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.
This epic is mainly focused on the 4.10 Release QE activities
1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis
To the track the QE progress at one place in 4.10 Release Confluence page
Acceptance criteria:
This epic covers a number of customer requests(RFEs) as well as increases usability.
Customer satisfaction as well as improved usability.
None
As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.
Form component https://github.com/openshift/console/pull/11227
As a user, I want to use a form to create Deployments
Edit deployment form ODC-5007
Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.
In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.
// JS type
telemetry?: Record<string, string>
./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy ./bin/bridge --telemetry CONSOLE_LOG=debug
Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support
tl;dr
oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).
In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).
If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.
More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla
Rename Provider to Infrastructure Provider
Add GPU Provider
https://miro.com/app/board/uXjVOeUB2B4=/?moveToWidget=3458764514332229879&cot=14
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges
No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.
We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.
This likely warrants an OpenShift blog post, potentially?
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':
test-ip-removal-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "port1", "namespace": "default" } ] test-ip-removal-net-port1: k8s.v1.cni.cncf.io/networks: [ { "name": "test-sriovnd", "interface": "net-port1", "namespace": "default" } ]
2. IP allocated in the IPPool:
kind: IPPool ... spec: allocations: "16": id: ... podref: test-ecoloma-1/test-ip-removal-port1 "17": id: ... podref: test-ecoloma-1/test-ip-removal-net-port1
3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:
[13:29][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 14m 11d
[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus 2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
allocations:
"17":
id: ...
podref: test-ecoloma-1/test-ip-removal-net-port1
range: 2001:1b70:820d:2610::/64
[13:30][]$ oc get cronjob -n openshift-multus
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
ip-reconciler */15 * * * * False 0 9s 11d
Actual results:
The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.
Expected results:
The network interface must not be removed, regardless of the name.
Additional info:
Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94
Description of problem:
The metal3 Pod of openshift-machine-api is in CrashLoopBackOff status.
Version-Release number of selected component (if applicable):
4.10.31
How reproducible:
Always reproduce in IPv6
Steps to Reproduce:
1. Preparing to configure ipi on the provisioning node - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..) 2. configuring the install-config.yaml (attached) - provisioningNetwork: disabled - machine network: only IPv6 - disconnected installation 3. deploy the cluster
Actual results:
It is not possible to add worker nodes because metal3 does not start normally.
Expected results:
metal3 starts normally in IPv6 environment
Additional info:
1. attached must-gather https://drive.google.com/file/d/1GKxj3syROIMnURx_PYzOYhJdEuXBNXVW/view?usp=sharing 2. pod status [kni@prov ~]$ oc get pod NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-6656bfd7b9-bt4j8 2/2 Running 0 35h cluster-baremetal-operator-6bbdd6758-rmxgq 2/2 Running 0 35h machine-api-controllers-55fb545b56-kl5sj 7/7 Running 0 34h machine-api-operator-845b6cf855-q7gdd 2/2 Running 0 35h metal3-574876cfdb-98fmz 6/7 CrashLoopBackOff 179 (3m17s ago) 14h metal3-image-cache-5mq2w 1/1 Running 0 14h metal3-image-cache-nftpj 1/1 Running 0 14h metal3-image-cache-t7whh 1/1 Running 0 14h metal3-image-customization-68d4d6d99b-dbqgn 1/1 Running 0 15h [kni@prov ~]$ oc logs metal3-574876cfdb-98fmz -c metal3-httpd AH00526: Syntax error on line 8 of /etc/httpd/conf.d/vmedia.conf: The port number "2001:feed:101::102" is outside the appropriate range (i.e., 1..65535).
This is a clone of issue OCPBUGS-2438. The following is the description of the original issue:
—
Description of problem:
On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to Observe > Alerting pages 2. Click on an alert (or go to the rules tab then click on a rule) 3. Click on one of the underlined fields (those that have a popover help)
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3111. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2992. The following is the description of the original issue:
—
Description of problem:
The metal3-ironic container image in OKD fails during steps in configure-ironic.sh that look for additional Oslo configuration entries as environment variables to configure the Ironic instance. The mechanism by which it fails in OKD but not OpenShift is that the image for OpenShift happens to have unrelated variables set which match the regex, because it is based on the builder image, but the OKD image is based only on a stream8 image without these unrelated OS_ prefixed variables set. The metal3 pod created in response to even a provisioningNetwork: Disabled Provisioning object will therefore crashloop indefinitely.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Deploy OKD to a bare metal cluster using the assisted-service, with the OKD ConfigMap applied to podman play kube, as in :https://github.com/openshift/assisted-service/tree/master/deploy/podman#okd-configuration 2. Observe the state of the metal3 pod in the openshift-machine-api namespace.
Actual results:
The metal3-ironic container repeatedly exits with nonzero, with the logs ending here: ++ export IRONIC_URL_HOST=10.1.1.21 ++ IRONIC_URL_HOST=10.1.1.21 ++ export IRONIC_BASE_URL=https://10.1.1.21:6385 ++ IRONIC_BASE_URL=https://10.1.1.21:6385 ++ export IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050 ++ IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050 ++ '[' '!' -z '' ']' ++ '[' -f /etc/ironic/ironic.conf ']' ++ cp /etc/ironic/ironic.conf /etc/ironic/ironic.conf_orig ++ tee /etc/ironic/ironic.extra # Options set from Environment variables ++ echo '# Options set from Environment variables' ++ env ++ grep '^OS_' ++ tee -a /etc/ironic/ironic.extra
Expected results:
The metal3-ironic container starts and the metal3 pod is reported as ready.
Additional info:
This is the PR that introduced pipefail to the downstream ironic-image, which is not yet accepted in the upstream: https://github.com/openshift/ironic-image/pull/267/files#diff-ab2b20df06f98d48f232d90f0b7aa464704257224862780635ec45b0ce8a26d4R3 This is the line that's failing: https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/scripts/configure-ironic.sh#L57 This is the image base that OpenShift uses for ironic-image (before rewriting in ci-operator): https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.ocp#L9 Here is where the relevant environment variables are set in the builder images for OCP: https://github.com/openshift/builder/blob/973602e0e576d7eccef4fc5810ba511405cd3064/hack/lib/build/version.sh#L87 Here is the final FROM line in the OKD image build (just stream8): https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.okd#L9 This results in the following differences between the two images: $ podman run --rm -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:519ac06836d972047f311de5e57914cf842716e22a1d916a771f02499e0f235c -c 'env | grep ^OS_' OS_GIT_MINOR=11 OS_GIT_TREE_STATE=clean OS_GIT_COMMIT=97530a7 OS_GIT_VERSION=4.11.0-202210061001.p0.g97530a7.assembly.stream-97530a7 OS_GIT_MAJOR=4 OS_GIT_PATCH=0 $ podman run --rm -it --entrypoint bash quay.io/openshift/okd-content@sha256:6b8401f8d84c4838cf0e7c598b126fdd920b6391c07c9409b1f2f17be6d6d5cb -c 'env | grep ^OS_' Here is what the OS_ prefixed variables should be used for: https://github.com/metal3-io/ironic-image/blob/807a120b4ce5e1675a79ebf3ee0bb817cfb1f010/README.md?plain=1#L36 https://opendev.org/openstack/oslo.config/src/commit/84478d83f87e9993625044de5cd8b4a18dfcaf5d/oslo_config/sources/_environment.py It's worth noting that ironic.extra is not consumed anywhere, and is simply being used here to save off the variables that Oslo _might_ be consuming (it won't consume the variables that are present in the OCP builder image, though they do get caught by this regex). With pipefail set, grep returns non-zero when it fails to find an environment variable that matches the regex, as in the case of the OKD ironic-image builds.
Description of problem:
Create Loadbalancer type service within the OCP 4.11.x OVNKubernetes cluster to expose the api server endpoint, the service does not response for normal oc request. But some of them are working, like "oc whoami", "oc get --raw /api"
Version-Release number of selected component (if applicable):
4.11.8 with OVNKubernetes
How reproducible:
always
Steps to Reproduce:
1. Setup openshift cluster 4.11 on AWS with OVNKubernetes as the default network 2. Create the following service under openshift-kube-apiserver namespace to expose the api ---- apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1800" finalizers: - service.kubernetes.io/load-balancer-cleanup name: test-api namespace: openshift-kube-apiserver spec: allocateLoadBalancerNodePorts: true externalTrafficPolicy: Cluster internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack loadBalancerSourceRanges: - <my_ip>/32 ports: - nodePort: 31248 port: 6443 protocol: TCP targetPort: 6443 selector: apiserver: "true" app: openshift-kube-apiserver sessionAffinity: None type: LoadBalancer 3. Setup the DNS resolution for the access xxx.mydomain.com ---> <elb-auto-generated-dns> 4. Try to access the cluster api via the service above by updating the kubeconfig to use the custom dns name
Actual results:
No response from the server side. $ time oc get node -v8 I1025 08:29:10.284069 103974 loader.go:375] Config loaded from file: bmeng.kubeconfig I1025 08:29:10.294017 103974 round_trippers.go:420] GET https://rh-api.bmeng-ccs-ovn.3o13.s1.devshift.org:6443/api/v1/nodes?limit=500 I1025 08:29:10.294035 103974 round_trippers.go:427] Request Headers: I1025 08:29:10.294043 103974 round_trippers.go:431] Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json I1025 08:29:10.294052 103974 round_trippers.go:431] User-Agent: oc/openshift (linux/amd64) kubernetes/e40bd2d I1025 08:29:10.365119 103974 round_trippers.go:446] Response Status: 200 OK in 71 milliseconds I1025 08:29:10.365142 103974 round_trippers.go:449] Response Headers: I1025 08:29:10.365148 103974 round_trippers.go:452] Audit-Id: 83b9d8ae-05a4-4036-bff6-de371d5bec12 I1025 08:29:10.365155 103974 round_trippers.go:452] Cache-Control: no-cache, private I1025 08:29:10.365161 103974 round_trippers.go:452] Content-Type: application/json I1025 08:29:10.365167 103974 round_trippers.go:452] X-Kubernetes-Pf-Flowschema-Uid: 2abc2e2d-ada3-4cb8-a86f-235df3a4e214 I1025 08:29:10.365173 103974 round_trippers.go:452] X-Kubernetes-Pf-Prioritylevel-Uid: 02f7a188-43c7-4827-af58-5ebe861a1891 I1025 08:29:10.365179 103974 round_trippers.go:452] Date: Tue, 25 Oct 2022 08:29:10 GMT ^C real 17m4.840s user 0m0.567s sys 0m0.163s However, it has the correct response if using --raw to request, eg: $ oc get --raw /api/v1 --kubeconfig bmeng.kubeconfig {"kind":"APIResourceList","groupVersion":"v1","resources":[{"name":"bindings","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"componentstatuses","singularName":"","namespaced":false,"kind":"ComponentStatus","verbs":["get","list"],"shortNames":["cs"]},{"name":"configmaps","singularName":"","namespaced":true,"kind":"ConfigMap","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["cm"],"storageVersionHash":"qFsyl6wFWjQ="},{"name":"endpoints","singularName":"","namespaced":true,"kind":"Endpoints","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ep"],"storageVersionHash":"fWeeMqaN/OA="},{"name":"events","singularName":"","namespaced":true,"kind":"Event","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ev"],"storageVersionHash":"r2yiGXH7wu8="},{"name":"limitranges","singularName":"","namespaced":true,"kind":"LimitRange","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["limits"],"storageVersionHash":"EBKMFVe6cwo="},{"name":"namespaces","singularName":"","namespaced":false,"kind":"Namespace","verbs":["create","delete","get","list","patch","update","watch"],"shortNames":["ns"],"storageVersionHash":"Q3oi5N2YM8M="},{"name":"namespaces/finalize","singularName":"","namespaced":false,"kind":"Namespace","verbs":["update"]},{"name":"namespaces/status","singularName":"","namespaced":false,"kind":"Namespace","verbs":["get","patch","update"]},{"name":"nodes","singularName":"","namespaced":false,"kind":"Node","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["no"],"storageVersionHash":"XwShjMxG9Fs="},{"name":"nodes/proxy","singularName":"","namespaced":false,"kind":"NodeProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"nodes/status","singularName":"","namespaced":false,"kind":"Node","verbs":["get","patch","update"]},{"name":"persistentvolumeclaims","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pvc"],"storageVersionHash":"QWTyNDq0dC4="},{"name":"persistentvolumeclaims/status","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["get","patch","update"]},{"name":"persistentvolumes","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pv"],"storageVersionHash":"HN/zwEC+JgM="},{"name":"persistentvolumes/status","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["get","patch","update"]},{"name":"pods","singularName":"","namespaced":true,"kind":"Pod","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["po"],"categories":["all"],"storageVersionHash":"xPOwRZ+Yhw8="},{"name":"pods/attach","singularName":"","namespaced":true,"kind":"PodAttachOptions","verbs":["create","get"]},{"name":"pods/binding","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"pods/ephemeralcontainers","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"pods/eviction","singularName":"","namespaced":true,"group":"policy","version":"v1","kind":"Eviction","verbs":["create"]},{"name":"pods/exec","singularName":"","namespaced":true,"kind":"PodExecOptions","verbs":["create","get"]},{"name":"pods/log","singularName":"","namespaced":true,"kind":"Pod","verbs":["get"]},{"name":"pods/portforward","singularName":"","namespaced":true,"kind":"PodPortForwardOptions","verbs":["create","get"]},{"name":"pods/proxy","singularName":"","namespaced":true,"kind":"PodProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"pods/status","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"podtemplates","singularName":"","namespaced":true,"kind":"PodTemplate","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"LIXB2x4IFpk="},{"name":"replicationcontrollers","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["rc"],"categories":["all"],"storageVersionHash":"Jond2If31h0="},{"name":"replicationcontrollers/scale","singularName":"","namespaced":true,"group":"autoscaling","version":"v1","kind":"Scale","verbs":["get","patch","update"]},{"name":"replicationcontrollers/status","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["get","patch","update"]},{"name":"resourcequotas","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["quota"],"storageVersionHash":"8uhSgffRX6w="},{"name":"resourcequotas/status","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["get","patch","update"]},{"name":"secrets","singularName":"","namespaced":true,"kind":"Secret","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"S6u1pOWzb84="},{"name":"serviceaccounts","singularName":"","namespaced":true,"kind":"ServiceAccount","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["sa"],"storageVersionHash":"pbx9ZvyFpBE="},{"name":"serviceaccounts/token","singularName":"","namespaced":true,"group":"authentication.k8s.io","version":"v1","kind":"TokenRequest","verbs":["create"]},{"name":"services","singularName":"","namespaced":true,"kind":"Service","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["svc"],"categories":["all"],"storageVersionHash":"0/CO1lhkEBI="},{"name":"services/proxy","singularName":"","namespaced":true,"kind":"ServiceProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"services/status","singularName":"","namespaced":true,"kind":"Service","verbs":["get","patch","update"]}]}
Expected results:
The normal oc request should be working.
Additional info:
There is no such issue for clusters with openshift-sdn with the same OpenShift version and same LoadBalancer service. We suspected that it might be related to the MTU setting, but this cannot explain why OpenShiftSDN works well. Another thing might be related is that the OpenShiftSDN is using iptables for service loadbalancing and OVN is dealing that within the OVN services.
Please let me know if any debug log/info is needed.
Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.
This bug represents a backport of CCO-222 to release-4.11.
See these threads https://coreos.slack.com/archives/G01F05P2PTL/p1645982017061749?thread_ts=1645970469.871559&cid=G01F05P2PTL for more information
This is a clone of issue OCPBUGS-501. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable): 4.10.16
How reproducible: Always
Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field
$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:
2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc
Actual results:
Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)
Expected results: The command "oc get dc" should display the deploymentconfig without any error.
Additional info:
Description of problem:
The ovn-kubernetes ovnkube-master containers are continuously crashlooping since we updated to 4.11.0-0.okd-2022-10-15-073651.
Log Excerpt:
] [] [] [{kubectl-client-side-apply Update networking.k8s.io/v1 2022-09-12 12:25:06 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:kubectl.kubernetes.io/last-applied-configuration":{}}},"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:&v1.LabelSelector{MatchLabels:map[string]string{access: true,},MatchExpressions:[]LabelSelectorRequirement{},},NamespaceSelector:nil,IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},} &NetworkPolicy{ObjectMeta:{allow-from-openshift-ingress compsci-gradcentral a405f843-c250-40d7-8dd4-a759f764f091 217304038 1 2022-09-22 14:36:38 +0000 UTC <nil> <nil> map[] map[] [] [] [{openshift-apiserver Update networking.k8s.io/v1 2022-09-22 14:36:38 +0000 UTC FieldsV1 {"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:nil,NamespaceSelector:&v1.LabelSelector{MatchLabels:map[string]string{policy-group.network.openshift.io/ingress: ,},MatchExpressions:[]LabelSelectorRequirement{},},IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},}]: cannot clean up egress default deny ACL name: error in transact with ops [{Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {ccdd01bf-3009-42fb-9672-e1df38190cd7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {10bbf229-8c1b-4c62-b36e-4ba0097722db}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {7b55ba0c-150f-4a63-9601-cfde25f29408}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {60cb946a-46e9-4623-9ba4-3cb35f018ed6}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s)
Additional info:
https://github.com/okd-project/okd/issues/1372 Issue persisted through update to 4.11.0-0.okd-2022-10-28-153352 must-gather: https://nbc9-snips.cloud.duke.edu/snips/must-gather.local.2859117512952590880.zip
OVS 2.17+ introduced an optimization of "weak references" to substantially speed up database snapshots. in some cases weak references may leak memory; to aforementioned commit fixes that and has been pulled into ovs2.17-62 and later.
The relevant code in ironic-image was not updated to support TLS, so it still uses the old port and explicit http://
Description of problem: Backport owners for 4.11 (only)
This is a clone of issue OCPBUGS-5100. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5068. The following is the description of the original issue:
—
Description of problem:
virtual media provisioning fails when iLO Ironic driver is used
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers 2. 3.
Actual results:
Provisioning fails with "An auth plugin is required to determine endpoint URL" error
Expected results:
Provisioning succeeds
Additional info:
Relevant log snippet: 3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75 78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last): 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector task.driver.boot.prepare_ramdisk(task, ramdisk_params=params) 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector result = f(*args, **kwargs) 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector iso = image_utils.prepare_deploy_iso(task, ramdisk_params, 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector return prepare_iso_image(inject_files=inject_files) 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector image_url = img_handler.publish_image( 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector swift_api = swift.SwiftAPI() 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__ 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector endpoint = keystone.get_endpoint('swift', session=session)
Description of problem:
Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10
https://access.redhat.com/solutions/6172402
When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`
which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7
Actual results:
4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod
Expected results:
I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment
Additional info:
This is a backport of https://bugzilla.redhat.com/show_bug.cgi?id=2116382 from 4.12 to 4.11.z. Creating manually because as seen in https://github.com/openshift/cluster-monitoring-operator/pull/1743 `/cherry-pick` doesn't work for bugs originally created in bugzilla
Description of problem:
In order to understand what is going on with OCPBUGS-5379 we want to add more logs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
For OVNK to become CNCF complaint, we need to support session affinity timeout feature and enable the e2e's on OpenShift side. This bug tracks the efforts to get this into 4.12 OCP.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 2089950](https://bugzilla.redhat.com/show_bug.cgi?id=2089950). The following is the description of the original bug:
—
Description of problem: Some upgrades failed during scale testing with messages indicating the console operator is not available. In total 5 out of 2200 clusters failed with this pattern.
These clusters are all configured with the Console operator disabled in order to reduce overall OCP cpu use in the Telecom environment. The following CR is applied:
apiVersion: operator.openshift.io/v1
kind: Console
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "false"
include.release.openshift.io/self-managed-high-availability: "false"
include.release.openshift.io/single-node-developer: "false"
release.openshift.io/create-only: "true"
ran.openshift.io/ztp-deploy-wave: "10"
name: cluster
spec:
logLevel: Normal
managementState: Removed
operatorLogLevel: Normal
From one cluster (sno01175) the ClusterVersion conditions show:
' | jq
[
,
,
,
,
,
{ "lastTransitionTime": "2022-05-24T13:57:05Z", "message": "Cluster operator kube-apiserver should not be upgraded between minor versions: KubeletMinorVersionUpgradeable: Kubelet minor version (1.22.5+5c84e52) on node sno01175 will not be supported in the next OpenShift minor version upgrade.", "reason": "KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade", "status": "False", "type": "Upgradeable" }]
Another cluster (sno01959) has very similar conditions with slight variation in the Failing and Progressing messages:
,
,
Version-Release number of selected component (if applicable): 4.9.26 upgrade to 4.10.13
How reproducible: 5 out of 2200
Steps to Reproduce:
1. Disable console with managementState: Removed
2. Starting OCP version 4.9.26
3. Initiate upgrade to 4.10.13 via ClusterVersion CR
Actual results: Cluster upgrade is stuck (no longer progressing) for 5+ hours
Expected results: Cluster upgrade completes
Additional info:
This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:
—
The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.
This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:
0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found
The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.
$ oc get cm vsphere-csi-config -o yaml -n openshift-cluster-csi-drivers | grep datacenters
datacenters = "IBMCloud"
Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:
level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []
Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.
I'm setting the priority to critical as it's causing all our jobs to fail in master.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-10514. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10221. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:
—
Description of problem:
When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.
Version-Release number of selected component (if applicable):
4.10.z, 4.11.z, 4.12, 4.13
How reproducible:
100%
Steps to Reproduce:
1. Install 4.10.34 2. Switch from stable-4.10 to stable-4.11 3.
Actual results:
Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them
Expected results:
Risks are computed in a timely manner for an interactive UX, lets say < 10s
Additional info:
This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be. We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.
This is a clone of issue OCPBUGS-858. The following is the description of the original issue:
—
Description of problem:
In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OCP 4.8 2. Upgrade to OCP 4.9 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-08-31-160214 True True 50m Working towards 4.9.47: 619 of 738 done (83% complete) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.47 True False 4m26s Cluster version is 4.9.47
Actual results:
Check packageserver CSV. It's in v0.17.0 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 0.17.0 Succeeded
Expected results:
packageserver CSV is at 0.18.3
Additional info:
packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12 packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8
This is a clone of issue OCPBUGS-4504. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-1557. The following is the description of the original issue:
—
Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:
"scheduling": { "automaticRestart": false, "onHostMaintenance": "MIGRATE", "preemptible": false, "provisioningModel": "STANDARD" },
From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:
If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".
But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:
$ git clone https://github.com/openshift/machine-api-provider-gcp.git $ cd machine-api-provider-gcp $ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api' 44f0f958 migrate to openshift/api
But that's not where the 4.9 and earlier code is located:
$ git branch -a | grep origin/release remotes/origin/release-4.10 remotes/origin/release-4.11 remotes/origin/release-4.12 remotes/origin/release-4.13
Hunting for 4.9 code:
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp gcp-machine-controllers https://github.com/openshift/cluster-api-provider-gcp c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6 gcp-pd-csi-driver https://github.com/openshift/gcp-pd-csi-driver 48d49f7f9ef96a7a42a789e3304ead53f266f475 gcp-pd-csi-driver-operator https://github.com/openshift/gcp-pd-csi-driver-operator d8a891de5ae9cf552d7d012ebe61c2abd395386e
So looking there:
$ git clone https://github.com/openshift/cluster-api-provider-gcp.git $ cd cluster-api-provider-gcp $ git log --oneline | grep 'migrate to openshift/api' ...no hits... $ git grep -i automaticRestart origin/release-4.9 | grep -v '"description"\|compute-gen.go' origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json: "automaticRestart": {
Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.
This is a clone of issue OCPBUGS-4311. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:
—
Description of problem:
Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level
https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
Version-Release number of selected component (if applicable): none
How reproducible: Every time
Steps to Reproduce:
Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
It is enabled by default and there is no way to disable it or reduce log level
Actual results:
Please check Case: 03371411, the log file grew to 409 GB
Expected results: Need a way to disable debug
Additional info: Case 03371411. A cluster must gather and log file can be found in the case.
This is a clone of issue OCPBUGS-6850. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:
—
Description of problem:
While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.
Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired' Jan 6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...
Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Version-Release number of selected component (if applicable):
4.11+ openshift-tests
How reproducible:
nondeterministic, wild guess is ~30% of upgrade jobs
Steps to Reproduce:
1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test
Actual results:
Jan 23 00:47:43.842: INFO: Admin Ack verified Jan 23 00:57:43.836: INFO: Admin Ack verified Jan 23 01:07:43.839: INFO: Admin Ack verified Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8
No check done after the upgrade
Expected results:
Jan 23 00:57:37.894: INFO: Admin Ack verified Jan 23 01:07:37.894: INFO: Admin Ack verified Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d Jan 23 01:17:57.937: INFO: Admin Ack verified
One or more checks done after upgrade
Following the trail
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1139
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1175
https://github.com/openshift/aws-ebs-csi-driver/pull/206
Looks like the fix should be in 4.12, but it still see it being 39 vs ~24 on an m6i instance type.
It seems that the kubelet applies this capacity to the node in 4.11 and earlier and, thus, unlikely to receive this fix for attachable volumes in the upstream CSI driver. 4.12 behavior is currently unknown but it seems that the kubelet might still be setting this capacity.
The actual issue is that kube scheduler schedules pods that require PVs to nodes where those PVs can not be attached.
This is a clone of issue OCPBUGS-10372. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-10220. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7559. The following is the description of the original issue:
—
Description of problem:
When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.
Version-Release number of selected component (if applicable):
4.12.3
How reproducible:
Consistent
Steps to Reproduce:
1. On a long lived cluster, add a new machineset
Actual results:
Machines reach "Provisioned" but don't join the cluster
Expected results:
Machines join cluster as nodes
Additional info:
Tracker bug for bootimage bump in 4.11. This bug should block bugs which need a bootimage bump to fix.
This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:
—
Frequently we see the loading state of the topology view, even when there aren't many resources in the project.
Including an example
topology will sometimes hang with the loading indicator showing indefinitely
topology should load consistently without fail
intermittent
4.9
Description of problem: Issue described in following issue: https://github.com/openshift/multus-admission-controller/issues/40
Fixed in: https://github.com/openshift/cluster-network-operator/pull/1515
Version-Release number of selected component (if applicable): OCP 4.10
Official Red Hat tracker. Issue has been merged already.
1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%
2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.
I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:
$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9
- alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
expr: |
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
for: 10m
labels:
severity: warning
3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster.
4. List any affected packages or components.
--> etcd
Description of problem:
The alibabacloud client "aliyun" would be used when pre-configuring some resources (e.g. VPC, bastion host, etc.) before launching an OCP cluster with customization.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5016. The following is the description of the original issue:
—
Description of problem:
When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel -> Actions -> Edit Pipeline -> YAML view
Actual results:
displayed content is incorrect.
Expected results:
Get the content of the current pipeline, not the "pipeline create" content.
Additional info:
If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content. ~ Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel -> Actions -> Edit Pipeline -> YAML view :Display resource content normally ~
This is a clone of issue OCPBUGS-1765. The following is the description of the original issue:
—
Description of problem:
If a customer creates a machine with a networks section like this networks: - filter: {} noAllowedAddressPairs: false subnets: - filter: {} uuid: primary-subnet-uuid - filter: {} noAllowedAddressPairs: true subnets: - filter: {} uuid: other-subnet-uuid primarySubnet: primary-subnet-uuid Then all the ports are created without the allowed address pairs. Doing some research in the source code, I have found that: - For each entry on the networks: section, networks are filtered as per its filter: section[1] - Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin: - The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4]. - For each subnet entry that matches the network ID, a port is created[5]. So, the problematic behavior happens due to the following: - Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks. - This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected. - However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering. - As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4]. Why do we consider this behavior undesired? I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.
Version-Release number of selected component (if applicable):
4.10.30
How reproducible:
Always
Steps to Reproduce:
1. Create a machineset like in the description 2. 3.
Actual results:
All ports have no address pairs
Expected results:
Only the port on the secondary subnet has no address pairs.
Additional info:
A simple workaround would be to just fill the filter so that a single network is selected for each network entry. References: [1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576 [2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580 [3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583 [4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660 [5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625
Description of problem:
When creating a ProjectHelmChartRepository (with or without the form) and setting a display name (as `spec.name`), this value is not used in the developer catalog / Helm Charts catalog filter sidebar.
It shows (and watches) the display names of `HelmChartRepository` resources.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Switch to Developer Perspective 2. Navigate to Add > "Helm Chart repositories" 3. Enter "ibm-charts" as "Chart repository name" 4. Enter URL https://raw.githubusercontent.com/IBM/charts/master/repo/community/index.yaml as URL) 5. Press on create 6. Open the YAML editor and change the `spec.name` attribute to "IBM Charts" 7. Save the change 8. Navigate to Add > "Helm Chart"
Actual results:
The filter navigation on the left side shows "Chart Repositories" "Ibm Chart". A camel case version of the resource name.
Expected results:
It should show the "spec.name" "IBM Charts" if defined and fallback to the current implementation if the optional spec.name is not defined.
Additional info:
There is a bug discussing that the display name could not be entered directly, https://bugzilla.redhat.com/show_bug.cgi?id=2106366. This bug here is only about the catalog output.
We are in the process of moving our bug tracking to JIRA. We should update the report bug link in the help menu to use JIRA instead of Bugzilla for new bugs. Opening as a medium severity bug since this only impacts prerelease OpenShift versions. For release versions, we have users open customer cases.
Clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106803 to backport the e2e fix to 4.11 and 4.10.
Description of problem: E2E: intermittent failure is seen on tests for devfile due to network call to devfile registry
Deploy git workload with devfile from topology page: A-04-TC01
Version-Release number of selected component (if applicable):
How reproducible: Intermittent
Steps to Reproduce:
1. Run test for add-flow-ci.feature to test Deploy git workload with devfile from topology page: A-04-TC01
Actual results:
Expected results: Show always pass
Additional info:
This is a clone of issue OCPBUGS-4897. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2500. The following is the description of the original issue:
—
Description of problem:
When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.
Version-Release number of selected component (if applicable):
How reproducible:
Always occurs
Steps to Reproduce:
1.Create a project with at least 12 components (Apps, Operators, knative Brokers) 2. Go to the Administrator Viewpoint 3. Switch to Developer Viewpoint/Topology 4. No components displayed 5. Click on 'fit to screen' 6. All components appear
Actual results:
Topology renders with all controls but no components visible (see screenshot 1)
Expected results:
All components should be visible
Additional info:
Description of problem:
Switching the spec.endpointPublishingStrategy.loadBalancer.scope of the default ingresscontroller results in a degraded ingress operator. The routes using that endpoint like the console URL become inaccessible.
Degraded operators after scope change:
$ oc get co | grep -v ' True False False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.4 False False True 72m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.kartrosa.ukld.s1.devshift.org/healthz": EOF console 4.11.4 False False False 72m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.kartrosa.ukld.s1.devshift.org): Get "https://console-openshift-console.apps.kartrosa.ukld.s1.devshift.org": EOF ingress 4.11.4 True False True 65m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
We have noticed that each time this happens the underlying AWS loadbalancer gets recreated which is as expected however the router pods probably do not get notified about the new loadbalancer. The instances in the new loadbalancer become 'outOfService'.
Restarting one of the router pods fixes the issue and brings back a couple of instances under the loadbalancer back to 'InService' which leads to the operators becoming happy again.
Version-Release number of selected component (if applicable):
ingress in 4.11.z however we suspect this issue to also apply to older versions
How reproducible:
Consistently reproducible
Steps to Reproduce:
1. Create a test OCP 4.11 cluster in AWS 2. Switch the spec.endpointPublishingStrategy.loadBalancer.scope of the default ingresscontroller in openshift-ingress-operator to Internal from External (or vice versa) 3. New Loadbalancer is created in AWS for the default router service, however the instances behind are not in service
Actual results:
ingress, authentication and console operators go into a degraded state. Console URL of the cluster is inaccessible
Expected results:
The ingresscontroller scope transition from internal->External (or vice versa) is smooth without any downtime or operators going into degraded state. The console is accessible.
This is a clone of issue OCPBUGS-5067. The following is the description of the original issue:
—
Description of problem:
Since coreos-installer writes to stdout, its logs are not available for us.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-6816. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6799. The following is the description of the original issue:
—
Description of problem:
The pipelines -> repositories list view in Dev Console does not show the running pipelineline as the last pipelinerun in the table.
Original BugZilla Link: https://bugzilla.redhat.com/show_bug.cgi?id=2016006
OCPBUGSM: https://issues.redhat.com/browse/OCPBUGSM-36408
This is a clone of issue OCPBUGS-533. The following is the description of the original issue:
—
Description of problem:
customer is using Azure AD as openid provider and groups synchronization from the provider.
The scenario is the following:
1)
2)
3)
The groups memberships are the same in step 2 and 3.
The cluster role bindings of the groups have never changed.
the only way to have user A again the admin rights is to delete the membership from the group and have user A login again.
I have not managed to reproduce this using RH SSO. Neither Azure AD.
But my configuration is not exactly the same yet.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Description of problem:
Created two egressIP object, egressIPs in one egressIP object cannot be applied successfully
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-11-27-164248
How reproducible:
Frequently happen in auto case
Steps to Reproduce:
1. Label two nodes as egress nodes oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME huirwang-1128a-s6j6t-master-0 Ready master 154m v1.24.6+5658434 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-master-1 Ready master 154m v1.24.6+5658434 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-master-2 Ready master 153m v1.24.6+5658434 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-worker-westus-1 Ready worker 135m v1.24.6+5658434 10.0.1.5 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 huirwang-1128a-s6j6t-worker-westus-2 Ready worker 136m v1.24.6+5658434 10.0.1.4 <none> Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa) 4.18.0-372.32.1.el8_6.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 % oc get node huirwang-1128a-s6j6t-worker-westus-1 --show-labels NAME STATUS ROLES AGE VERSION LABELS huirwang-1128a-s6j6t-worker-westus-1 Ready worker 136m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0 % oc get node huirwang-1128a-s6j6t-worker-westus-2 --show-labels NAME STATUS ROLES AGE VERSION LABELS huirwang-1128a-s6j6t-worker-westus-2 Ready worker 136m v1.24.6+5658434 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0 2. Created two egressIP objects 3.
Actual results:
egressip-47032 was not applied to any egress node % oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-47032 10.0.1.166 egressip-47034 10.0.1.181 huirwang-1128a-s6j6t-worker-westus-1 10.0.1.181 % oc get cloudprivateipconfig NAME AGE 10.0.1.130 6m25s 10.0.1.138 6m34s 10.0.1.166 6m34s 10.0.1.181 6m25s % oc get cloudprivateipconfig 10.0.1.166 -o yaml apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-47032 creationTimestamp: "2022-11-28T10:27:37Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.1.166 resourceVersion: "87528" uid: 5221075a-35d0-4670-a6a7-ddfc6cbc700b spec: node: huirwang-1128a-s6j6t-worker-westus-1 status: conditions: - lastTransitionTime: "2022-11-28T10:33:29Z" message: 'Error processing cloud assignment request, err: <nil>' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: huirwang-1128a-s6j6t-worker-westus-1 % oc get cloudprivateipconfig 10.0.1.138 -o yaml apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-47032 creationTimestamp: "2022-11-28T10:27:37Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.1.138 resourceVersion: "87523" uid: e4604e76-64d8-4735-87a2-eb50d28854cc spec: node: huirwang-1128a-s6j6t-worker-westus-2 status: conditions: - lastTransitionTime: "2022-11-28T10:33:29Z" message: 'Error processing cloud assignment request, err: <nil>' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: huirwang-1128a-s6j6t-worker-westus-2 oc logs cloud-network-config-controller-6f7b994ddc-vhtbp -n openshift-cloud-network-config-controller ....... E1128 10:30:43.590807 1 controller.go:165] error syncing '10.0.1.138': error assigning CloudPrivateIPConfig: "10.0.1.138" to node: "huirwang-1128a-s6j6t-worker-westus-2", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-2_10.0.1.138)."}], requeuing in cloud-private-ip-config workqueue I1128 10:30:44.051422 1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.1.166" will be added to node: "huirwang-1128a-s6j6t-worker-westus-1" E1128 10:30:44.301259 1 controller.go:165] error syncing '10.0.1.166': error assigning CloudPrivateIPConfig: "10.0.1.166" to node: "huirwang-1128a-s6j6t-worker-westus-1", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-1_10.0.1.166)."}], requeuing in cloud-private-ip-config workqueue ..........
Expected results:
EgressIP can be applied successfully.
Additional info:
Description of problem:
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2074299 for backporting purposes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
Version-Release number of selected component (if applicable): 4.10.22
How reproducible: Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Steps to Reproduce:
Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Actual results: master nodes do come up.
Expected results: master nodes should come up despite that the text "master" is not in their hostname.
Additional info:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
The code for the cluster-baremetal-operator at the following link:
The following condition is concerning:
if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0
The packages reveal that bmh.Name references the name inside the metadata of the BMH object.
Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :
macs = append(macs, bmh.Spec.BootMACAddress)
This is a clone of issue OCPBUGS-6517. The following is the description of the original issue:
—
Description of problem:
When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].
[1]https://github.com/openshift/cluster-image-registry-operator/pull/819
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy a cluster with proxy and restricted installation 2. 3.
Actual results:
Expected results:
Additional info:
This fix contains the following changes coming from updated version of kubernetes up to v1.24.10:
Changelog:
v1.24.11: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v12410
v1.24.10: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1249
v1.24.9: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1248
v1.24.8: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1247
v1.24.7: https://github.com/kubernetes/kubernetes/blob/release-1.24/CHANGELOG/CHANGELOG-1.24.md#changelog-since-v1246
OCPBUGS-1251 landed an admin-ack gate in 4.11.z to help admins prepare for Kubernetes 1.25 API removals which are coming in OpenShift 4.12. Poking around in a 4.12.0-ec.2 cluster where APIRemovedInNextReleaseInUse is firing:
$ oc --as system:admin adm must-gather -- /usr/bin/gather_audit_logs $ zgrep -h v1beta1/poddisruptionbudget must-gather.local.1378724704026451055/quay*/audit_logs/kube-apiserver/*.log.gz | jq -r '.verb + " " + (.user | .username + " " + (.extra["authentication.kubernetes.io/pod-name"] | tostr ing))' | sort | uniq -c parse error: Invalid numeric literal at line 29, column 6 28 watch system:serviceaccount:openshift-machine-api:cluster-autoscaler ["cluster-autoscaler-default-5cf997b8d6-ptgg7"]
Finding the source for that container:
$ oc --as system:admin -n openshift-machine-api get -o json pod cluster-autoscaler-default-5cf997b8d6-ptgg7 | jq -r '.status.containerStatuses[].image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed $ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed | grep github SOURCE_GIT_URL=https://github.com/openshift/kubernetes-autoscaler io.openshift.build.commit.url=https://github.com/openshift/kubernetes-autoscaler/commit/1dac0311b9842958ec630273428b74703d51c1c9 io.openshift.build.source-location=https://github.com/openshift/kubernetes-autoscaler
Poking about in the source:
$ git clone --depth 30 --branch master https://github.com/openshift/kubernetes-autoscaler.git
$ cd kubernetes-autoscaler
$ find . -name vendor
./addon-resizer/vendor
./cluster-autoscaler/vendor
./vertical-pod-autoscaler/e2e/vendor
./vertical-pod-autoscaler/vendor
Lots of vendoring. I haven't checked to see how new the client code is in the various vendor packages. But the main issue seems to be the v1beta1 in:
$ git grep policy cluster-autoscaler/core cluster-autoscaler/utils | grep policy.*v1beta1 cluster-autoscaler/core/scaledown/actuation/actuator_test.go: policyv1beta1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/actuation/actuator_test.go: eviction := createAction.GetObject().(*policyv1beta1.Eviction) cluster-autoscaler/core/scaledown/actuation/drain.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/actuation/drain_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/legacy/legacy.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/legacy/wrapper.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/scaledown/scaledown.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/core/static_autoscaler_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/drain/drain.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/drain/drain_test.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/kubernetes/listers.go: policyv1 "k8s.io/api/policy/v1beta1" cluster-autoscaler/utils/kubernetes/listers.go: v1policylister "k8s.io/client-go/listers/policy/v1beta1"
The main change from v1beta1 to v1 involves spec.selector; I dunno if that's relevant to the autoscaler use-case or not.
Do we run autoscaler CI? I was poking around a bit, but did not find a 4.12 periodic excercising the autoscaler that might have turned up this alert and issue.
Description of problem:
[4.11.z] Fix kubevirt-console tests
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4489. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:
—
Description of problem:
Prometheus continuously restarts due to slow WAL replay
Version-Release number of selected component (if applicable):
openshift - 4.11.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5185. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5165. The following is the description of the original issue:
—
Currently, the Dev Sandbox clusters sends the clusterType "OSD" instead of "DEVSANDBOX" because the configuration annotations of the console config are automatically overridden by some SyncSets.
Open Dev Sandbox and browser console and inspect window.SERVER_FLAGS.telemetry
Description of problem:
release-4.11 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.24 branch. We should import them in our downstream fork.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-6913. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-186. The following is the description of the original issue:
—
Description of problem:
When resizing the browser window, the PipelineRun task status bar would overlap the status text that says "Succeeded" in the screenshot.
Actual results:
Status text is overlapped by the task status bar
Expected results:
Status text breaks to a newline or gets shortened by "..."
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.
Actual results:
Taking too much time too load
Expected results:
Time taken should be reduced
Additional info:
Attached a gif for reference
This is a clone of issue OCPBUGS-268. The following is the description of the original issue:
—
The linux kernel was updated:
https://lkml.org/lkml/2020/3/20/1030
to include steal
accounting
This would greatly assist in troubleshooting vSphere performance issues
caused by over-provisioned ESXi hosts.
This bug is a backport clone of [Bugzilla Bug 2118318](https://bugzilla.redhat.com/show_bug.cgi?id=2118318). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2117569 +++
Description of problem:
The garbage collector resource quota controller must ignore ALL events; otherwise, if a rogue controller or a workload causes unbound event creation, performance will degrade as it has to process the events.
Fix: https://github.com/kubernetes/kubernetes/pull/110939
This bug is to track fix in master (4.12) and also allow to backport to 4.11.1
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
— Additional comment from Michal Fojtik on 2022-08-11 10:52:28 UTC —
I'm using FastFix here as we need to backport this to 4.11.1 to avoid support churn for busy clusters or clusters doing upgrades.
— Additional comment from ART BZ Bot on 2022-08-11 15:13:32 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from zhou ying on 2022-08-12 03:03:34 UTC —
checked the payload commit id , the payload 4.12.0-0.nightly-2022-08-11-191750 has container the fixed pr .
oc adm release info registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-08-11-191750 --commit-urls |grep hyperkube
Warning: the default reading order of registry auth file will be changed from "${HOME}/.docker/config.json" to podman registry config locations in the future version. "${HOME}/.docker/config.json" is deprecated, but can still be used for storing credentials as a fallback. See https://github.com/containers/image/blob/main/docs/containers-auth.json.5.md for the order of podman registry config locations.
hyperkube https://github.com/openshift/kubernetes/commit/da80cd038ee5c3b45ba36d4b48b42eb8a74439a3
commit da80cd038ee5c3b45ba36d4b48b42eb8a74439a3 (HEAD -> master, origin/release-4.13, origin/release-4.12, origin/master, origin/HEAD)
Merge: a9d6306a701 055b96e614a
Author: OpenShift Merge Robot <openshift-merge-robot@users.noreply.github.com>
Date: Thu Aug 11 15:13:05 2022 +0000
Merge pull request #1338 from benluddy/openshift-pick-110888
Bug 2117569: UPSTREAM: 110888: feat: fix a bug thaat not all event be ignored by gc controller
This is a clone of issue OCPBUGS-3265. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3172. The following is the description of the original issue:
—
Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".
We shouldn't block operator installation if permission issues prevent dynamic plugin installation.
This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".
See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD
This is a clone of issue OCPBUGS-3021. The following is the description of the original issue:
—
Description of problem:
me-west1 is not listed in the survey
Steps to Reproduce:
1. run survey (openshift-install create install-config without an install config file) 2. go through prompts until regions 3.
Actual results:
me-west1 region is missing
Expected results:
me-west1 region is listed (and install succeeds in the region)
Description of problem:
When the user installs a helm chart, the dropdown to select a specific version is always disabled. Also for helm charts that can upgraded or downgraded after installation. For example the nodejs helm chart.
Version-Release number of selected component (if applicable):
At least 4.11, maybe all versions, but a backport to 4.11 is fine
How reproducible:
Always
Steps to Reproduce:
1. Switch to developer perspective 2. Navigate to Add > Helm chart 3. Select the Nodejs helm chart 4. Try to select another version .. When the user installs the not selectable version and edit the helm chart there is another version to selec
Actual results:
The version is not selectable.
Expected results:
The version should be selectable.
Additional info:
This is a clone of issue OCPBUGS-4250. The following is the description of the original issue:
—
Description of problem:
Added a script to collect PodNetworkConnectivityChecks to able to view the overall status of the pod network connectivity. Current must-gather collects the contents of `openshift-network-diagnostics` but does not collect the PodNetworkConnectivityCheck.
Version-Release number of selected component (if applicable):
4.12, 4.11, 4.10
This is a clone of issue OCPBUGS-676. The following is the description of the original issue:
—
the machine approver isn't recognizing hostnames that use capital letters as valid even though DNS is case-insensitive
an example of this is in OHSS-14709:
I0822 19:04:51.587266 1 controller.go:114] Reconciling CSR: csr-vdtpv I0822 19:04:51.600941 1 csr_check.go:156] csr-vdtpv: CSR does not appear to be client csr I0822 19:04:51.603648 1 csr_check.go:542] retrieving serving cert from ip-100-66-119-117.ec2.internal (100.66.119.117:10250) I0822 19:04:51.604003 1 csr_check.go:181] Failed to retrieve current serving cert: dial tcp 100.66.119.117:10250: connect: connection refused I0822 19:04:51.604017 1 csr_check.go:201] Falling back to machine-api authorization for ip-100-66-119-117.ec2.internal E0822 19:04:51.604024 1 csr_check.go:392] csr-vdtpv: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com I0822 19:04:51.604033 1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com I0822 19:04:51.606777 1 controller.go:199] csr-vdtpv: CSR not authorized
This can be worked around by manually approving the CSR
The relevant line in the machine approver appears to be here: https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L378
Description of problem:
This repository is out of sync with the downstream product builds for this component.
One or more images differ from those being used by ART to create product builds. This
should be addressed to ensure that the component's CI testing is accurately
reflecting what customers will experience.
The information within the following ART component metadata is driving this alignment
request: ose-baremetal-installer.yml.
The vast majority of these PRs are opened because a different Golang version is being
used to build the downstream component. ART compiles most components with the version
of Golang being used by the control plane for a given OpenShift release. Exceptions
to this convention (i.e. you believe your component must be compiled with a Golang
version independent from the control plane) must be granted by the OpenShift
architecture team and communicated to the ART team.
Version-Release number of selected component (if applicable): 4.11
Additional info: This issue is needed for the bot-created PR to merge after the 4.11 GA.
Description of problem:
When creating a incomplete ClusterServiceVersion resource the OLM details page crashes (on 4.11).
apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion metadata: name: minimal-csv namespace: christoph spec: apiservicedefinitions: owned: - group: A kind: A name: A version: v1 customresourcedefinitions: owned: - kind: B name: B version: v1 displayName: My minimal CSV install: strategy: ''
Version-Release number of selected component (if applicable):
Crashes on 4.8-4.11, work fine from 4.12 onwards.
How reproducible:
Alway
Steps to Reproduce:
1. Apply the ClusterServiceVersion YAML from above
2. Open the Admin perspective > Installed Operator > Operator detail page
Actual results:
Details page crashes on tab A and B.
Expected results:
Page should not crash
Additional info:
Thi is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2084287
This is a clone of issue OCPBUGS-6671. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-3228. The following is the description of the original issue:
—
While starting a Pipelinerun using UI, and in the process of providing the values on "Start Pipeline" , the IBM Power Customer (Deepak Shetty from IBM) has tried creating credentials under "Advanced options" with "Image Registry Credentials" (Authenticaion type). When the IBM Customer verified the credentials from Secrets tab (in Workloads) , the secret was found in broken state. Screenshot of the broken secret is attached.
The issue has been observed on OCP4.8, OCP4.9 and OCP4.10.
libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.
This is a clone of issue OCPBUGS-1704. The following is the description of the original issue:
—
Description of problem:
According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
Always, if the Service Usage API is disabled in the GCP project.
Steps to Reproduce:
1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project. 2. Try IPI installation in the GCP project.
Actual results:
The installation would fail finally, without any worker machines launched.
Expected results:
Installation should succeed, or the OCP doc should be updated.
Additional info:
Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. FYI if enabling the API, and without changing anything else, the installation could succeed.
Description of problem:
Manual backport of * https://github.com/openshift/cluster-dns-operator/pull/336 * https://github.com/openshift/cluster-dns-operator/pull/339
Version-Release number of selected component (if applicable):
4.11
This is a clone of issue OCPBUGS-2508. The following is the description of the original issue:
—
Description of problem:
Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes. $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-kwtf8-master-0 Running 23h openshift-machine-api ostest-kwtf8-master-1 Running 23h openshift-machine-api ostest-kwtf8-master-2 Running 23h openshift-machine-api ostest-kwtf8-worker-0-g7nrw Provisioning 23h openshift-machine-api ostest-kwtf8-worker-0-lrkvb Provisioning 23h openshift-machine-api ostest-kwtf8-worker-0-vwrsk Provisioning 23h $ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller [...] E1018 10:51:49.355143 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-10-14-023020
How reproducible:
Always
Steps to Reproduce:
1. Install 4.10 within provider networks (in primary or secondary interface)
Actual results:
Installation failure: 4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out
Expected results:
Successful installation
Additional info:
Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.
Description of problem:
Setting disableNetworkDiagnostics: true does not persist when network-operator pod gets re-created.
Version-Release number of selected component (if applicable):
4.11.0-rc.0
How reproducible:
100%
Steps to Reproduce:
1. $ oc patch network.operator.openshift.io/cluster --patch '{"spec":{"disableNetworkDiagnostics":true}}' --type=merge
network.operator.openshift.io/cluster patched
2. $ oc -n openshift-network-operator delete pods network-operator-9b68954c6-bclx6
pod "network-operator-9b68954c6-bclx6" deleted
3. $ oc get network.operator.openshift.io cluster -o json | jq .spec.disableNetworkDiagnostics
false
Actual results:
disableNetworkDiagnostics set to false, not set to true as configured in step 1
Expected results:
disableNetworkDiagnostics set to true
Additional info:
Attaching must-gather.
This is a clone of issue OCPBUGS-8044. The following is the description of the original issue:
—
rhbz#2089199, backported to 4.11.5, shifted the etcd Grafana dashboard from the monitoring operator to the etcd operator. During the shift, the ConfigMap was renamed from grafana-dashboard-etcd to etcd-dashboard. However, we did not include logic for garbage-collecting the obsolete dasboard, so clusters that update from 4.11.1 and similar into 4.11.>=5 or 4.12+ currently end up with both the obsolete and new ConfigMaps. We should grow code to remove the obsolete ConfigMap.
4.11.>=5 and 4.12+ are currently exposed.
100%
1. Install 4.11.1.
2. Update to a release that defines the etcd-dashboard ConfigMap.
3. Check for etcd dashboards with oc -n openshift-config-managed get configmaps | grep etcd.
Both etcd-dashboard and grafana-dashboard-etcd exist:
$ oc -n openshift-config-managed get configmaps | grep etcd etcd-dashboard 1 196d grafana-dashboard-etcd 1 2y282d
Another example is 4.11.1 to 4.11.5 CI:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1570415394001260544/artifacts/e2e-aws-upgrade/configmaps.json | jq -r '.items[].metadata | select(.namespace == "openshift-config-managed" and (.name | contains("etcd"))) | .name' etcd-dashboard grafana-dashboard-etcd
Only etcd-dashboard still exists.
A new manifest for the outgoing ConfigMap that sets the release.openshift.io/delete: "true" annotation would ask the cluster-version operator to reap the obsolete ConfigMap.
In order to delete the correct GCP cloud resources, the "--credentials-requests-dir" parameter must be passed to "ccoctl gcp delete". This was fixed for 4.12 as part of https://github.com/openshift/cloud-credential-operator/pull/489 but must be backported for previous releases. See https://github.com/openshift/cloud-credential-operator/pull/489#issuecomment-1248733205 for discussion regarding this bug.
To reproduce, create GCP infrastructure with a name parameter that is a subset of another set of GCP infrastructure's name parameter. I will "ccoctl gcp create all" with "name=abutcher-gcp" and "name=abutcher-gcp1".
$ ./ccoctl gcp create-all \ --name=abutcher-gcp \ --region=us-central1 \ --project=openshift-hive-dev \ --credentials-requests-dir=./credrequests $ ./ccoctl gcp create-all \ --name=abutcher-gcp1 \ --region=us-central1 \ --project=openshift-hive-dev \ --credentials-requests-dir=./credrequests
Running "ccoctl gcp delete --name=abutcher-gcp" will result in GCP infrastructure for both "abutcher-gcp" and "abutcher-gcp1" being deleted.
$ ./ccoctl gcp delete --name abutcher-gcp --project openshift-hive-dev 2022/10/24 11:30:06 Credentials loaded from file "/home/abutcher/.gcp/osServiceAccount.json" 2022/10/24 11:30:06 Deleted object .well-known/openid-configuration from bucket abutcher-gcp-oidc 2022/10/24 11:30:07 Deleted object keys.json from bucket abutcher-gcp-oidc 2022/10/24 11:30:07 OIDC bucket abutcher-gcp-oidc deleted 2022/10/24 11:30:09 IAM Service account abutcher-gcp-openshift-image-registry-gcs deleted 2022/10/24 11:30:10 IAM Service account abutcher-gcp-openshift-gcp-ccm deleted 2022/10/24 11:30:11 IAM Service account abutcher-gcp1-openshift-cloud-network-config-controller-gcp deleted 2022/10/24 11:30:12 IAM Service account abutcher-gcp-openshift-machine-api-gcp deleted 2022/10/24 11:30:13 IAM Service account abutcher-gcp-openshift-ingress-gcp deleted 2022/10/24 11:30:15 IAM Service account abutcher-gcp-openshift-gcp-pd-csi-driver-operator deleted 2022/10/24 11:30:16 IAM Service account abutcher-gcp1-openshift-ingress-gcp deleted 2022/10/24 11:30:17 IAM Service account abutcher-gcp1-openshift-image-registry-gcs deleted 2022/10/24 11:30:19 IAM Service account abutcher-gcp-cloud-credential-operator-gcp-ro-creds deleted 2022/10/24 11:30:20 IAM Service account abutcher-gcp1-openshift-gcp-pd-csi-driver-operator deleted 2022/10/24 11:30:21 IAM Service account abutcher-gcp1-openshift-gcp-ccm deleted 2022/10/24 11:30:22 IAM Service account abutcher-gcp1-cloud-credential-operator-gcp-ro-creds deleted 2022/10/24 11:30:24 IAM Service account abutcher-gcp1-openshift-machine-api-gcp deleted 2022/10/24 11:30:25 IAM Service account abutcher-gcp-openshift-cloud-network-config-controller-gcp deleted 2022/10/24 11:30:25 Workload identity pool abutcher-gcp deleted
This is a clone of issue OCPBUGS-683. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
{ 4.12.0-0.nightly-2022-08-21-135326 }How reproducible:
Steps to Reproduce:
{See https://bugzilla.redhat.com/show_bug.cgi?id=2118563#c5, The following messages here are "normal" on startup, but it is very misleading with error statement, suggest suppress them or update them to some more clear context that we can know they are in normal process. E0818 02:18:53.709223 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.715530 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.735885 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.775984 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.790449 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.856911 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.950782 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.017583 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.271967 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.338944 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.916988 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.982211 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue}
Actual results:
Expected results:
Additional info:
Description of problem:
AWS tagging - when applying user defined tags you cannot add more than 10
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
[Updated story request]
Decision is to always display Red Hat OpenShift logo for OCP instead of conditionally. And also update the OCP login, errors, providers templates. https://openshift.github.io/oauth-templates/
Related note in comments.
[Original request]
If the ACM or the ACS dynamic plugin is enabled and there is not a custom branding set, then the default "Red Hat Openshift" branding should be shown.
This was identified as an issue during the Hybrid Console Scrum on 11/15/20201
PRs associated with this change
https://github.com/openshift/console/pull/10940 [merged]
https://github.com/openshift/oauth-templates/pull/20 [merged]
https://github.com/openshift/cluster-authentication-operator/pull/540 [merged]
Description of problem:
The test local-test is failing on openshift/thanos when upgrading golang version to 1.18 on the branch release-4.11. Please refer to this test log for details: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_thanos/82/pull-ci-openshift-thanos-release-4.11-test-local/1541516614497734656
Version-Release number of selected component (if applicable):
4.11
How reproducible:
See local-test job on pull request on the repository Openshift/Thanos
Steps to Reproduce:
Actual results:
local-test fails on the following error: level=error ts=2022-06-27T20:28:12.306Z caller=web.go:99 component=web msg="panic while serving request" client=127.0.0.1:37064 url=/api/v1/metadata err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 278 [running]:\ngithub.com/prometheus/prometheus/web.withStackTracer.func1.1()\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:98 +0x99\npanic({0x1c34760, 0x308ad40})\n\t/usr/lib/golang/src/runtime/panic.go:838 +0x207\nreflect.mapiternext(0xc000458540?)\n\t/usr/lib/golang/src/runtime/map.go:1378 +0x19\ngithub.com/modern-go/reflect2.(*UnsafeMapIterator).UnsafeNext(0x1bd62e0?)\n\t/go/pkg/mod/github.com/modern-go/reflect2@v1.0.1/unsafe_map.go:136 +0x32\ngithub.com/json-iterator/go.(*sortKeysMapEncoder).Encode(0xc000949d10, 0xc0002966b0, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_map.go:297 +0x31a\ngithub.com/json-iterator/go.(*onePtrEncoder).Encode(0xc0008cb120, 0xc000948fc0, 0xc0001139c0?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:219 +0x82\ngithub.com/json-iterator/go.(*Stream).WriteVal(0xc0006c7740, {0x1c16da0, 0xc000948fc0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:98 +0x158\ngithub.com/json-iterator/go.(*dynamicEncoder).Encode(0xc00094cd58?, 0xfa9a07?, 0xc0006c7758?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_dynamic.go:15 +0x39\ngithub.com/json-iterator/go.(*structFieldEncoder).Encode(0xc000949620, 0x1a4aaba?, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_struct_encoder.go:110 +0x56\ngithub.com/json-iterator/go.(*structEncoder).Encode(0xc000949740, 0x0?, 0xc0006c7740)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_struct_encoder.go:158 +0x652\ngithub.com/json-iterator/go.(*OptionalEncoder).Encode(0xc0001afd60?, 0x0?, 0x0?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect_optional.go:74 +0xa4\ngithub.com/json-iterator/go.(*onePtrEncoder).Encode(0xc0008cad40, 0xc0006c76e0, 0xc000949020?)\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:219 +0x82\ngithub.com/json-iterator/go.(*Stream).WriteVal(0xc0006c7740, {0x1ac56e0, 0xc0006c76e0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/reflect.go:98 +0x158\ngithub.com/json-iterator/go.(*frozenConfig).Marshal(0xc0001afd60, {0x1ac56e0, 0xc0006c76e0})\n\t/go/pkg/mod/github.com/json-iterator/go@v1.1.9/config.go:299 +0xc9\ngithub.com/prometheus/prometheus/web/api/v1.(*API).respond(0xc0002d7a40, {0x229a448, 0xc00022bd60}, {0x1c16da0?, 0xc000948fc0}, {0x0?, 0x7fe5a05a5b20?, 0x20?})\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/api/v1/api.go:1437 +0x162\ngithub.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1({0x229a448, 0xc00022bd60}, 0x7fe5982c5300?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/api/v1/api.go:273 +0x20b\nnet/http.HandlerFunc.ServeHTTP(0x7fe5982c5300?, {0x229a448?, 0xc00022bd60?}, 0xc00072b270?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP({{0x2290780?, 0xc000856288?}}, {0x7fe5982c5300?, 0xc00072b270?}, 0x228fb20?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/util/httputil/compression.go:90 +0x69\ngithub.com/prometheus/prometheus/web.(*Handler).testReady.func1({0x7fe5982c5300?, 0xc00072b270?}, 0x7fe5982c5300?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:499 +0x39\nnet/http.HandlerFunc.ServeHTTP(0x7fe5982c5300?, {0x7fe5982c5300?, 0xc00072b270?}, 0x50?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1({0x7fe5982c5300?, 0xc00072b220?}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:196 +0xa5\nnet/http.HandlerFunc.ServeHTTP(0x228fb80?, {0x7fe5982c5300?, 0xc00072b220?}, 0xc000948ed0?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2({0x7fe5982c5300, 0xc00072b220}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:76 +0xa2\nnet/http.HandlerFunc.ServeHTTP(0x22a4a68?, {0x7fe5982c5300?, 0xc00072b220?}, 0x0?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1({0x22a4a68?, 0xc00072b1d0?}, 0xc000250c00)\n\t/go/pkg/mod/github.com/prometheus/client_golang@v1.6.0/prometheus/promhttp/instrument_server.go:100 +0x94\ngithub.com/prometheus/prometheus/web.setPathWithPrefix.func1.1({0x22a4a68, 0xc00072b1d0}, 0xc000250b00)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:1142 +0x290\ngithub.com/prometheus/common/route.(*Router).handle.func1({0x22a4a68, 0xc00072b1d0}, 0xc000250a00, {0x0, 0x0, 0xc00022c364?})\n\t/go/pkg/mod/github.com/prometheus/common@v0.10.0/route/route.go:83 +0x2ae\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc0001cc780, {0x22a4a68, 0xc00072b1d0}, 0xc000250a00)\n\t/go/pkg/mod/github.com/julienschmidt/httprouter@v1.3.0/router.go:387 +0x82b\ngithub.com/prometheus/common/route.(*Router).ServeHTTP(0x8?, {0x22a4a68?, 0xc00072b1d0?}, 0x203000?)\n\t/go/pkg/mod/github.com/prometheus/common@v0.10.0/route/route.go:121 +0x26\nnet/http.StripPrefix.func1({0x22a4a68, 0xc00072b1d0}, 0xc000250900)\n\t/usr/lib/golang/src/net/http/server.go:2127 +0x330\nnet/http.HandlerFunc.ServeHTTP(0x10?, {0x22a4a68?, 0xc00072b1d0?}, 0x7fe5c8423f18?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\nnet/http.(*ServeMux).ServeHTTP(0x413d87?, {0x22a4a68, 0xc00072b1d0}, 0xc000250900)\n\t/usr/lib/golang/src/net/http/server.go:2462 +0x149\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x22a3808?, 0xc000a282a0}, 0xc000250200)\n\t/go/pkg/mod/github.com/opentracing-contrib/go-stdlib@v0.0.0-20190519235532-cf7a6c988dc9/nethttp/server.go:140 +0x662\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x22a3808?, 0xc000a282a0?}, 0xffffffffffffffff?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\ngithub.com/prometheus/prometheus/web.withStackTracer.func1({0x22a3808?, 0xc000a282a0?}, 0xc0008ca850?)\n\t/go/pkg/mod/github.com/prometheus/prometheus@v1.8.2-0.20200724121523-657ba532e42f/web/web.go:103 +0x97\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x22a3808?, 0xc000a282a0?}, 0xc000100000?)\n\t/usr/lib/golang/src/net/http/server.go:2084 +0x2f\nnet/http.serverHandler.ServeHTTP({0xc000c55380?}, {0x22a3808, 0xc000a282a0}, 0xc000250200)\n\t/usr/lib/golang/src/net/http/server.go:2916 +0x43b\nnet/http.(*conn).serve(0xc0000d1540, {0x22a4e18, 0xc00061a0c0})\n\t/usr/lib/golang/src/net/http/server.go:1966 +0x5d7\ncreated by net/http.(*Server).Serve\n\t/usr/lib/golang/src/net/http/server.go:3071 +0x4db\n" level=error ts=2022-06-27T20:28:12.306Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:37064" msg="runtime error: invalid memory address or nil pointer dereference"
Expected results:
local-test does no fail on the error above.
Additional info:
Tracker issue for bootimage bump in 4.11. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-3362.
This is a clone of issue OCPBUGS-7458. The following is the description of the original issue:
—
Description of problem:
- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted. - We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful. - This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again. - The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.
Version-Release number of selected component (if applicable):
How reproducible:
N/A, I wasn't able to reproduce the issue.
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.
The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).
The visible results of that are:
4.10
Always if there are static pods.
1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).
error: errors occurred while gathering data: one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd [one or more errors occurred while gathering container data for pod etcd-master-0.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net: the server could not find the requested resource]
No errors like the ones above and rotated pod logs to be gathered, if present.
Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.
This is a clone of issue OCPBUGS-1237. The following is the description of the original issue:
—
job=pull-ci-openshift-origin-master-e2e-gcp-builds=all
This test has started permafailing on e2e-gcp-builds:
[sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod [apigroup:build.openshift.io][apigroup:image.openshift.io]
The error in the test says
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:21 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Successfully pulled image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" in 1.763914719s Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Created: Created container test Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Started: Started container test Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:24 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" already present on machine Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:25 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Unhealthy: Readiness probe failed: Get "http://10.129.2.63:8080/": dial tcp 10.129.2.63:8080: connect: connection refused Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:26 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} BackOff: Back-off restarting failed container
Description of problem:
Looks like a regression was introduced and the Windows networking tests started to fail accross all 4.11 CI jobs for both Linux-to-Windows and Windows-to-Windows pods communication. https://github.com/openshift/ovn-kubernetes/pull/1415 merged without passing Windows networking tests. https://github.com/openshift/windows-machine-config-operator/pull/1359 shows networking tests failing accross all CI jobs.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Trigger a job retest in any of the above mentioned PRs
Actual results:
Windows networking test fail
Expected results:
Windows networking test pass
Additional info:
Direct link to a release-4.11 aws-e2e-operator CI job showing the failed networking test.
Node healthz server was added in 4.13 with https://github.com/openshift/ovn-kubernetes/commit/c8489e3ff9c321e77f265dc9d484ed2549df4a6b and https://github.com/openshift/ovn-kubernetes/commit/9a836e3a547f3464d433ce8b9eef336624d51858. We need to configure it by default on 0.0.0.0:10256 on CNO for ovnk, just like we do for sdn.
As discussed previously by email, customer support case 03211616 requests a means to use the latest patch version of a given X.Y golang via imagestreams, with the blocking issue being the lack of X.Y tags for the go-toolset containers on RHCC.
The latter has now been fixed with the latest version also getting a :1.17 tag, and the imagestream source has been modified accordingly, which will get picked up in 4.12. We can now fix this in 4.11.z by backporting this to the imagestream files bundled in cluster-samples-operator.
/cc Ian Watson Feny Mehta
We have created a fix in 4.12 that fetches instance type information from Azure API instead of updating the lists. We feel that backporting that fix is too risky, but agreed to update the list in older versions.
Description of problem:
Add the following instance types to azure_instance_types list[1]:
Version-Release number of selected component (if applicable):
OCP 4.8
Steps to Reproduce:
1. Migrate worker/infra nodes to above mentioned (missing) v5 instance types
2. "Failed to set autoscaling from zero annotations, instance type unknown"
Actual results:
Expected results:
The new instance types are available in the azure_instance_types list[1] and no errors/warnings are observed after migrating:
Additional info:
The related v4 instance types are already available[1] - I suspect adding the mentioned v5 instance types is a minor update:
1) azure_instance_types.go
https://github.com/openshift/cluster-api-provider-azure/blob/release-4.8/pkg/cloud/azure/actuators/machineset/azure_instance_types.go
Origin tests for the bond-cni
Backport of https://github.com/openshift/origin/pull/27405
Description of problem:
In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12. Installed COs cannot be disabled, so CVO is supposed to implicitly enable them. $ oc get clusterversion -oyaml { "apiVersion": "config.openshift.io/v1", "kind": "ClusterVersion", "metadata": { "creationTimestamp": "2022-09-30T05:02:31Z", "generation": 3, "name": "version", "resourceVersion": "134808", "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb" }, "spec": { "capabilities": { "additionalEnabledCapabilities": [ "openshift-samples" ], "baselineCapabilitySet": "None" }, "channel": "stable-4.11", "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34", "desiredUpdate": { "force": true, "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "" } }, "status": { "availableUpdates": null, "capabilities": { "enabledCapabilities": [ "openshift-samples" ], "knownCapabilities": [ "Console", "Insights", "Storage", "baremetal", "marketplace", "openshift-samples" ] }, "conditions": [ { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Capabilities match configured spec", "reason": "AsExpected", "status": "False", "type": "ImplicitlyEnabledCapabilities" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" }, { "lastTransitionTime": "2022-09-30T05:23:18Z", "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419", "status": "True", "type": "Available" }, { "lastTransitionTime": "2022-09-30T07:05:42Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2022-09-30T07:41:53Z", "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419", "status": "False", "type": "Progressing" } ], "desired": { "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "4.12.0-0.nightly-2022-09-28-204419" }, "history": [ { "completionTime": "2022-09-30T07:41:53Z", "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "startedTime": "2022-09-30T06:42:01Z", "state": "Completed", "verified": false, "version": "4.12.0-0.nightly-2022-09-28-204419" }, { "completionTime": "2022-09-30T05:23:18Z", "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631", "startedTime": "2022-09-30T05:02:33Z", "state": "Completed", "verified": false, "version": "4.11.0-0.nightly-2022-09-29-191451" } ], "observedGeneration": 3, "versionHash": "CSCJ2fxM_2o=" } } $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 True False False 93m cloud-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h56m cloud-credential 4.12.0-0.nightly-2022-09-28-204419 True False False 3h59m cluster-autoscaler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m config-operator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m console 4.11.0-0.nightly-2022-09-29-191451 True False False 3h45m control-plane-machine-set 4.12.0-0.nightly-2022-09-28-204419 True False False 117m csi-snapshot-controller 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m dns 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m etcd 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m image-registry 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m ingress 4.12.0-0.nightly-2022-09-28-204419 True False False 151m insights 4.11.0-0.nightly-2022-09-29-191451 True False False 3h48m kube-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-scheduler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-storage-version-migrator 4.12.0-0.nightly-2022-09-28-204419 True False False 91m machine-api 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m machine-approver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m machine-config 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m monitoring 4.12.0-0.nightly-2022-09-28-204419 True False False 3h44m network 4.12.0-0.nightly-2022-09-28-204419 True False False 3h55m node-tuning 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m openshift-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-samples 4.12.0-0.nightly-2022-09-28-204419 True False False 116m operator-lifecycle-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m service-ca 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m storage 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-28-204419
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.11 cluster with only openshift-samples enabled 2. Upgrade to 4.12 3.
Actual results:
The 4.12 introduced optional CO console and insights are not upgraded to 4.12
Expected results:
All the installed COs get upgraded
Additional info:
This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:
—
Description of problem:
When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running
Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters hypershift operator is quay.io/hypershift/hypershift-operator:4.11 4.11.9 management cluster
How reproducible:
Happened once
Steps to Reproduce:
1. 2. 3.
Actual results:
oc get co network reports False availability
Expected results:
oc get co network reports True availability
Additional info:
This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335
—
Description of problem:
The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)
How reproducible:
Attach a NIC to a master node will trigger the issue
Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")
Actual results:
~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[
,
,
,
{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }]
$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h
$ oc get co etcd -o json | jq ".status.conditions[0]"
{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }~~~
Expected results:
To have the certificate valid also for the second IP (the newly created one "10.0.187.247")
Additional info:
Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd | awk '
' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted
$ oc get co etcd -o json | jq ".status.conditions[0]"
{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }~~~
Created attachment 1905034 [details]
Plugin page with error
Steps to reproduce:
1. Install a plugin with a page that has a runtime error. (Demo Plugin -> Dynamic Nav 1 currently has an error for me, but you can reproduce by editing any plugin and introducing an error.)
2. Observe the "something went wrong" error message.
3. Navigate to any other page (e.g. Workloads -> Pods)
Expected result:
The pods page is displayed.
Action result:
The error message persists. There is no way to clear except to refresh the browser.
Description of problem:
This bugs purpose is to enable a feature backport of https://issues.redhat.com/browse/MON-1949.
This bug is a backport clone of [Bugzilla Bug 2094362](https://bugzilla.redhat.com/show_bug.cgi?id=2094362). The following is the description of the original bug:
—
Description of problem:
A change [1] was introduced to split the kube-apiserver SLO rules into 2 groups to reduce the load on Prometheus (see bug 2004585).
Version-Release number of selected component (if applicable):
4.9 (because the change was backported to 4.9.z)
How reproducible:
Always
Steps to Reproduce:
1. Install OCP 4.9
2. Retrieve kube-apiserver-slos*
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos -o yaml
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos-basic -o yaml
Actual results:
The KubeAPIErrorBudgetBurn alert with labels
{long="1h",namespace="openshift-kube-apiserver",severity="critical",short="5m"}exists both in kube-apiserver-slos and kube-apiserver-slos-basic.
The alerting rules is evaluated twice. The same is true for recording rules like "apiserver_request:burnrate1h" and in this case, it can trigger warning logs in the Prometheus pods:
> level=warn component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=283
Expected results:
I presume that kube-apiserver-slos shouldn't exist since it's been replaced by kube-apiserver-slos-basic and kube-apiserver-slos-extended.
Additional info:
Discovered while investigating bug 2091902
Description of problem:
This issue exists to drive the backport process of https://github.com/openshift/api/pull/1313
According to the Kubernetes documentation, starting from Kubernetes 1.22, the service-account-issuer flag can be specified multiple times. The first value is then used to generate new tokens and other values are accepted. Using this field can prevent cluster disruptions and allows for smoother reconfiguration of this field.
The status field will allow us to keep track of "used" service account issuers and also expire/prune them.
this is a replacement for: #1309
xref: https://issues.redhat.com/browse/AUTH-309
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4072. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-4026. The following is the description of the original issue:
—
Description of problem:
There is an endless re-render loop and a browser feels slow to stuck when opening the add page or the topology.
Saw also endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
Version-Release number of selected component (if applicable):
1. Console UI 4.12-4.13 (master)
2. Service Binding Operator (tested with 1.3.1)
How reproducible:
Always with installed SBO
But the "stuck feeling" depends on the browser (Firefox feels more stuck) and your locale machine power
Steps to Reproduce:
1. Install Service Binding Operator
2. Create or update the BindableKinds resource "bindable-kinds"
apiVersion: binding.operators.coreos.com/v1alpha1 kind: BindableKinds metadata: name: bindable-kinds
3. Open the browser console log
4. Open the console UI and navigate to the add page
Actual results:
1. Saw endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
2. Browser feels slow and get stuck after some time
3. The page crashs after some time
Expected results:
1. The API call should be called just once
2. The add page should just work without feeling laggy
3. No crash
Additional info:
Get introduced after we watching the bindable-kinds resource with https://github.com/openshift/console/pull/11161
It looks like this happen only if the SBO is installed and the bindable-kinds resource exist, but doesn't contain any status.
The status list all available bindable resource types. I could not reproduce this by installing and uninstalling an operator, but you can manually create or update this resource as mentioned above.
Description of problem:
During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-2895. The following is the description of the original issue:
—
Description of problem:
Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters
Steps to Reproduce:
1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name
Actual results:
See error message: encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]
Expected results:
Create a cluster/machineset using the existing and valid DiskEncryptionSet
Additional info:
I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513
This is a clone of issue OCPBUGS-4422. The following is the description of the original issue:
—
This bug is a backport clone of [Bugzilla Bug 2050230](https://bugzilla.redhat.com/show_bug.cgi?id=2050230). The following is the description of the original bug:
—
Description of problem:
In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources.
Version-Release number of selected component (if applicable):
How reproducible:
NA
Steps to Reproduce:
NA
Actual results:
Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible.
Expected results:
Kube API Server and Openshift API Server should be stable.
Additional info:
We will want to establish some basic metrics we can report back to Telemetry.
Let's consider:
Below is some background info from MTC when we added Telemetry support that may help
See: https://github.com/konveyor/metrics-queries/blob/master/README.md
Design/Development info:
OpenShift Monitoring Integration Guide
Monitoring integration with OLM operators
https://www.openshift.com/blog/observability-superpower-correlation
Source Code:
https://github.com/konveyor/mig-controller/blob/master/pkg/controller/migmigration/metrics.go
Description of problem:
Users search a resource (for example, Pod) with Name filter applied and input a text to the filter field then the search results filtered accordingly.
Once the results are shown, when the user clear the value in one-shot (i.e. select whole filter text from the field and clear it using delete or backspace key) from the field,
then the search result doesn't clear accordingly and the previous result stays on the page.
Version-Release number of selected components (if applicable):
4.11.0-0.nightly-2022-08-16-194731 & works fine with OCP 4.12 latest version.
How reproducible:
Always
Steps to Reproduce:
Actual results:
Search result doesn't clear when user clears name filter in one-shot for any resources.
Expected results:
Search results should clear when the user clears name filters in one-shot for any resources.
Additional info:
Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers.
Attached screen share for the same issue. SearchIssues.mp4
This is a clone of issue OCPBUGS-262. The following is the description of the original issue:
—
github rate limit failures for upi image downloading govc.
This is a clone of issue OCPBUGS-3824. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-2598. The following is the description of the original issue:
—
Description of problem:
Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too.
This makes the main command
ovs-appctl -t ovs-monitor-ipsec
To hang until the subcommand is finished.
As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.
Version-Release number of selected component (if applicable):
Openshift Container Platform 4.10 but i think the same will be visible to other versions too.
How reproducible:
I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.
Steps to Reproduce:
1. Install an Openshift cluster with IPSEC enabled 2. Scale to 170+ nodes or more 3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.
Actual results:
Ip Sec pods are stuck in a Crashloopbackoff state
Expected results:
Ip Sec pods to work normally
Additional info:
We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.
Description of problem:
When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart.
The restart happens because of (ovnkube-master pods)
2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-09-114621
How reproducible:
Steps to Reproduce:
1. Run node-density on a 120 node cluster
Actual results:
Spike observed in pod-latency graph ~22s
Expected results:
Steady pod-latency graph ~4s
Additional info:
This is a clone of issue OCPBUGS-1417. The following is the description of the original issue:
—
Description of problem:
Egress IP is not being assigned to primary interface of node as per hostsubnet definition. The issue being observed at an Openshift cluster hosted on Disconnected AWS environment. Following steps were performed at AWS end: - Disconnected VPC was created and installation of Openshift was done as per documentation. - Elastic IP could not be used as it is a disconnected environment. Customer identified a free IP from same subnet as the node and modified interface of the node to add a secondary IP. It seems cloud.network.openshift.io/egress-ipconfig annotation is need on the node to attach IP to primary interface but its missing. From SDN POD log on the same node I could see its complaining about 'an incomplete annotation "cloud.network.openshift.io/egress-ipconfig"'. Will share more details over comments.
Version-Release number of selected component (if applicable):
Openshift 4.10.28
How reproducible:
Always
Steps to Reproduce:
1. Create a disconnected environment on AWS 2. find a free IP from subnet where a worker node is hosted and add that as secondary IP to NIC of that node. 3. Configure hostsubnet and netnamespace on Openshift cluster
Actual results:
- Eress IP is not being attached to primary interface of node for which hostsubnet has been configured
Expected results:
- Egress IP should get configured without any issue.
Additional info:
This is a clone of issue OCPBUGS-451. The following is the description of the original issue:
—
Description of problem:
Git icon shown in the repository details page should be based on the git provider.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Create a Repository with gitlab repo url
2. Navigate to the detail page.
Actual results:
github icon is displayed for the gitlab url.
Expected results:
gitlab icon should be displayed for the gitlab url.
Additional info:
use `GitLabIcon` and `BitBucketIcon` from patternfly react-icons.
Description of problem: defined in https://bugzilla.redhat.com/show_bug.cgi?id=2051533
When adding remote worker node using ZTP the agent finishes the installation and is marked as done. oc get agent -o wide NAME CLUSTER APPROVED ROLE STAGE HOSTNAME REQUESTED HOSTNAME 0277804e-2a7c-4d95-9d0f-e22a190d582a spoke-0 true worker Done spoke-worker-0-0.spoke-0.qe.lab.redhat.com spoke-worker-0-0 12efa520-5b99-4474-805d-931e46ad43f7 spoke-0 true master Done spoke-master-0-2.spoke-0.qe.lab.redhat.com spoke-master-0-2 3b8eec89-f26f-4896-8f71-8a810894c560 spoke-0 true master Done spoke-master-0-0.spoke-0.qe.lab.redhat.com spoke-master-0-0 3fb3749e-c132-4258-ad1a-08a0445c9022 spoke-0 true worker Done spoke-worker-0-1.spoke-0.qe.lab.redhat.com spoke-worker-0-1 728559e9-5543-41d9-adb0-e58196f765af spoke-0 true master Done spoke-master-0-1.spoke-0.qe.lab.redhat.com spoke-master-0-1 982e1ff6-6e83-4800-b061-8cdfd0b844fb spoke-0 true worker Done spoke-rwn-0-1.spoke-rwn-0.qe.lab.redhat.com spoke-rwn-0-1 a76eaa6a-b351-429f-bfa1-e53a70503573 spoke-0 true worker Done spoke-rwn-0-0.spoke-rwn-0.qe.lab.redhat.com spoke-rwn-0-0 Logging into the spoke cluster the bmh and machine resources are created and the node resource is not: oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE spoke-master-0-0 unmanaged spoke-0-pxbfh-master-0 true 3h32m spoke-master-0-1 unmanaged spoke-0-pxbfh-master-1 true 3h32m spoke-master-0-2 unmanaged spoke-0-pxbfh-master-2 true 3h32m spoke-rwn-0-0-bmh externally provisioned spoke-0-spoke-rwn-0-0-bmh true provisioned registration error 168m spoke-rwn-0-1-bmh externally provisioned spoke-0-spoke-rwn-0-1-bmh true provisioned registration error 168m spoke-worker-0-0 unmanaged spoke-0-pxbfh-worker-0-65mrb true 3h32m spoke-worker-0-1 unmanaged spoke-0-pxbfh-worker-0-nnmcq true 3h32m oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE spoke-0-pxbfh-master-0 Running 3h33m spoke-0-pxbfh-master-1 Running 3h33m spoke-0-pxbfh-master-2 Running 3h33m spoke-0-pxbfh-worker-0-65mrb Running 3h19m spoke-0-pxbfh-worker-0-nnmcq Running 3h20m spoke-0-spoke-rwn-0-0-bmh Provisioned 169m spoke-0-spoke-rwn-0-1-bmh Provisioned 169m Note: bmh is in error state: Normal ProvisionedRegistrationError 30m metal3-baremetal-controller Host adoption failed: Error while attempting to adopt node 529b3e75-5d04-4486-9296-269081d0ec02: Error validating Redfish virtual media. Some parameters were missing in node's driver_info. Missing are: ['deploy_kernel', 'deploy_ramdisk']. oc get nodes NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 72m v1.22.3+2cb6068 spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 50m v1.22.3+2cb6068 spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 72m v1.22.3+2cb6068 spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 51m v1.22.3+2cb6068 spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 51m v1.22.3+2cb6068 node-bootstrapper CSR is created but not auto-approved; periodically another node-strapper csr is created until it is manually approved: oc get csr | grep Pending csr-5ll2g 9m9s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-f8vbl 8m24s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
Version-Release number of selected component (if applicable):
assisted-service master at revision af0bafb3f7f629932f8c3dc31ccddedfe6984926 ocp version: 4.10.0-rc.1
How reproducible:
1. Install remote worker node using ztp 2. Wait for node resource to be created
Steps to Reproduce:
1. Install remote worker node using ztp 2. Wait for node resource to be created
Actual results:
node-bootstrapper and node CSR are not auto-approved and node resource is not created. The bmh resource remains in registration error
Expected results:
node-bootstrapper and node CSR should be auto-approved and node resource created. The bmh resource should not be in registration error
Additional info:
This is a clone of issue OCPBUGS-1717. The following is the description of the original issue:
—
Description of problem:
Image registry pods panic while deploying OCP in me-central-1 AWS region
Version-Release number of selected component (if applicable):
4.11.2
How reproducible:
Deploy OCP in AWS me-central-1 region
Steps to Reproduce:
Deploy OCP in AWS me-central-1 region
Actual results:
panic: Invalid region provided: me-central-1
Expected results:
Image registry pods should come up with no errors
Additional info:
This is a clone of issue OCPBUGS-4486. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-95. The following is the description of the original issue:
—
In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.
The bug is 100% reproducible.
Steps for reproducing the issue are:
1. Install a cluster with OpenShiftSDN network plugin.
2. Configure egressip for a project.
3. Install NMstate operator.
4. Create a NodeNetworkConfigurationPolicy.
5. Identify on which node the egressIP is present.
6. Restart the nmstate-handler pod running on the identified node.
7. Verify that the egressIP is no more present.
Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.
This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.
The expectation is that NMstate operator should not interfere with SDN configuration.
This is a clone of issue OCPBUGS-6766. The following is the description of the original issue:
—
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2083087 (OCPBUGSM-44070) to backport this issue.
Description of problem:
"Delete dependent objects of this resource" is a bit of confusing for some users because when creating the Application in Dev console not only the deployment but also IS, route, svc, secret objects will be created as well. When deleting the Application (in fact it is deployment), there is an option called "Delete dependent objects of this resource" and some users might think this means the IS, route, svc and any other objects which are created alongside with the deployment will be deleted as well
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Always
Steps to Reproduce:
1. Create Application in Dev console
2. Delete the deployment
3. Check "Delete dependent objects of this resource"
Actual results:
Only deployment will be deleted and IS, svc, route will not be deleted
Expected results:
We either change the description of this option, or we really delete IS, svc, route and any other objects created under this Application.
Additional info:
This is a clone of issue OCPBUGS-8491. The following is the description of the original issue:
—
Description of problem:
Image registry pods panic while deploying OCP in ap-southeast-4 AWS region
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Deploy OCP in AWS ap-southeast-4 region
Steps to Reproduce:
Deploy OCP in AWS ap-southeast-4 region
Actual results:
panic: Invalid region provided: ap-southeast-4
Expected results:
Image registry pods should come up with no errors
Additional info:
We're seeing a slight uptick in how long upgrades are taking[1][2]. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff[3].
Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run[4] showing 10+ minutes waiting for it.
TRT had a debugging session, and we have two suggestions:
[1] https://search.ci.openshift.org/graph/metrics?metric=job%3Aduration%3Atotal%3Aseconds&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade
[2] https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%2075.00%20minutes
[3] https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-rc.7
[4] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade/1556865989923049472
Description of problem:
When adding new nodes to the existing cluster, the newly allocated node-subnet can be overlapped with the existing node.
Version-Release number of selected component (if applicable):
openshift 4.10.30
How reproducible:
It's quite hard to reproduce but there is a possibility it can happen any time.
Steps to Reproduce:
1. Create a OVN dual-stack cluster 2. add nodes to the existing cluster 3. check the allocated node subnet
Actual results:
Some newly added nodes have the same node-subnet and ovn-k8s-mp0 IP as some existing nodes.
Expected results:
Should have duplicated node-subnet and ovn-k8s-mp0 IP
Additional info:
Additional info can be found at the case 03329155 and the must-gather attached(comment #1) % omg logs ovnkube-master-v8crc -n openshift-ovn-kubernetes -c ovnkube-master | grep '2022-09-30T06:42:50.857' 2022-09-30T06:42:50.857031565Z W0930 06:42:50.857020 1 master.go:1422] Did not find any logical switches with other-config 2022-09-30T06:42:50.857112441Z I0930 06:42:50.857099 1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node worker01.ss1.samsung.local 2022-09-30T06:42:50.857122455Z I0930 06:42:50.857105 1 master.go:1003] Allocated Subnets [10.129.4.0/23 fd02:0:0:a::/64] on Node oam04.ss1.samsung.local 2022-09-30T06:42:50.857130289Z I0930 06:42:50.857122 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node worker01.ss1.samsung.local 2022-09-30T06:42:50.857140773Z I0930 06:42:50.857132 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.129.4.0/23","fd02:0:0:a::/64"]}] on node oam04.ss1.samsung.local 2022-09-30T06:42:50.857166726Z I0930 06:42:50.857156 1 master.go:1003] Allocated Subnets [10.128.2.0/23 fd02:0:0:5::/64] on Node oam01.ss1.samsung.local 2022-09-30T06:42:50.857176132Z I0930 06:42:50.857157 1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node rhel01.ss1.samsung.local 2022-09-30T06:42:50.857176132Z I0930 06:42:50.857167 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.2.0/23","fd02:0:0:5::/64"]}] on node oam01.ss1.samsung.local 2022-09-30T06:42:50.857185257Z I0930 06:42:50.857157 1 master.go:1003] Allocated Subnets [10.128.6.0/23 fd02:0:0:d::/64] on Node call03.ss1.samsung.local 2022-09-30T06:42:50.857192996Z I0930 06:42:50.857183 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node rhel01.ss1.samsung.local 2022-09-30T06:42:50.857200017Z I0930 06:42:50.857190 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.6.0/23","fd02:0:0:d::/64"]}] on node call03.ss1.samsung.local 2022-09-30T06:42:50.857282717Z I0930 06:42:50.857258 1 master.go:1003] Allocated Subnets [10.130.2.0/23 fd02:0:0:7::/64] on Node call01.ss1.samsung.local 2022-09-30T06:42:50.857304886Z I0930 06:42:50.857293 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.130.2.0/23","fd02:0:0:7::/64"]}] on node call01.ss1.samsung.local 2022-09-30T06:42:50.857338896Z I0930 06:42:50.857314 1 master.go:1003] Allocated Subnets [10.128.4.0/23 fd02:0:0:9::/64] on Node f501.ss1.samsung.local 2022-09-30T06:42:50.857349485Z I0930 06:42:50.857329 1 master.go:1003] Allocated Subnets [10.131.2.0/23 fd02:0:0:8::/64] on Node call02.ss1.samsung.local 2022-09-30T06:42:50.857371344Z I0930 06:42:50.857354 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.4.0/23","fd02:0:0:9::/64"]}] on node f501.ss1.samsung.local 2022-09-30T06:42:50.857371344Z I0930 06:42:50.857361 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.2.0/23","fd02:0:0:8::/64"]}] on node call02.ss1.samsung.local
Description of problem:
With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.
This is a clone of issue OCPBUGS-5761. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5458. The following is the description of the original issue:
—
reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479
Description of problem:
Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-7474. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-6714. The following is the description of the original issue:
—
Description of problem:
Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46
a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.
More description about issue is available in private comment below since it contains customer data.
This is a clone of issue OCPBUGS-1130. The following is the description of the original issue:
—
The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.
As of Sep 13th:
Even on some major variant combos, a 4-8% failure rate is too high.
On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?
Has the problem been getting worse?
I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.
This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
At a glance: LeaseGrant, MemberList, Txn, Status, Range.
Broken out of TRT-401
For linking with sippy:
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]
Description of problem:
Cannot scale up worker node have deploying OCP 4.11.1 cluster via UPI on Azure
5h2m Warning FailedCreate machine/pokus-2knkh-worker-northeurope1-f6kc4 InvalidConfiguration: failed to reconcile machine "pokus-2knkh-worker-northeurope1-f6kc4": failed to create vm pokus-2knkh-worker-northeurope1-f6kc4: failure sending request for machine pokus-2knkh-worker-northeurope1-f6kc4: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 - Original Error: Code="NotFound" Message="The Image '/subscriptions/e639e479-2737-4b3d-b338-f1928f6429a1/resourceGroups/mlpipe-2163-azpln-rg/providers/Microsoft.Compute/images/pokus-2knkh-gen2' cannot be found in 'northeurope' region."
Customer would like to have the installer create machineset from the inital installation, therefore Kubernetes manifest files that define the worker machines were not removed during the installation.
Highlights:
Can I please let help verifying if these are the correct steps to have the initial installation created and manage the worker machines?Is there an explanation on how changing the image to -gen2 in [concat(parameters('baseName'),'-gen2')] from the 02_storage.json template can resolve the problem?
Version-Release number of selected component (if applicable):
Environment:
OCP 4.11.1 UPI install on Azure using ARM
VM size:
bootstrap: Standard_D4s_v3
master: Standard_D4s_v3
How reproducible:
Always
Steps to Reproduce:
Following the step described in the document: Installing a cluster on Azure using ARM templates .
In the install-config.yaml, worker replicas was set to 0
compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3
After creating the manifests described in this step: Creating the Kubernetes manifest and Ignition config files only control plane machines manifests were removed, worker machines manifests remain untouchedAfter three masters and three worker nodes were created by ARM templates, additional worker were added using machine sets via command
oc scale --replicas=1 machineset cluster-g7rzv-worker-francecentral1 -n openshift-machine-api`
Actual results:
No addition node visible from `oc get nodes` and the following error occur:
5h2m Warning FailedCreate machine/pokus-2knkh-worker-northeurope1-f6kc4 InvalidConfiguration: failed to reconcile machine "pokus-2knkh-worker-northeurope1-f6kc4": failed to create vm pokus-2knkh-worker-northeurope1-f6kc4: failure sending request for machine pokus-2knkh-worker-northeurope1-f6kc4: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=404 - Original Error: Code="NotFound" Message="The Image '/subscriptions/e639e479-2737-4b3d-b338-f1928f6429a1/resourceGroups/mlpipe-2163-azpln-rg/providers/Microsoft.Compute/images/pokus-2knkh-gen2' cannot be found in 'northeurope' region."
The customer found out that this can be resolved if changing the -image to -gen2 in [concat(parameters('baseName'),'-gen2')] from the 02_storage.json template
Expected results:
The installer should be able to create and manage machineset
Additional info:
SFDC case #03304526
Slack discussion, might due to MAO not able to support UPI in Azure Thread1, Thread2