Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Currently the Get started with on-premise host inventory quickstart gets delivered in the Core console. If we are going to keep it here we need to add the MCE or ACM operator as a prerequisite, otherwise it's very confusing.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Customers typically run more than one cluster and/or applications deployed across different regions. In such a hybrid cloud environment, aggregating metrics is a key requirement to avoid admins and or applications owners to drop in into individual clusters to troubleshoot specific problems. And since Red Hat does not offer a standalone metrics aggregation service, customers have started to use existing, home-grown technologies based on, for example, InfluxDB or Kafka to achieve that.
In summary:
Expose Prometheus remote-write configuration via our OpenShift Monitoring (Cluster and User Workload) ConfigMap to allow customers to push time-series data to a remote location.
Please note that we do not plan to support certain third party “receivers” with this solution. Customers will be responsible to ensure an appropriate receiving component is up and running that implements the “remote-write” API. Here is a list of possible “receiver” plugins.
User configures one of the available ConfigMaps to allow node_cpu_seconds_total to be written into a remote Thanos system.
Remote write allows to replicate time-series data to a remote location. This is important for several scenarios like you want to use "remote-write enabled" systems (e.g. InfluxDB) for long-term storage and historical analysis; as well as for aggregating metrics across multiple clusters.
Currently, remote-write is in an experimental stage in Prometheus[1] but the chances are high that it will be stable some time this year. Furthermore, we are using remote-write pretty extensively already for Telemetry as well as ACM in the near future. With that in mind, we think that we are in a perfect spot to move what we already have[2] from dev preview to at least tech preview.
[1] https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec - "If specified, the remote_write spec. This is an experimental feature, it may change in any upcoming release in a breaking way." The experimental flag was removed.
We'll want to give user the option to add remote_write configs to both the cluster monitoring and UWM.
AC:
As a cluster administrator,
I want OpenShift to include a recent CoreDNS version,
so that I have the latest available performance and security fixes.
We should strive to follow upstream CoreDNS releases by bumping openshift/coredns with every OpenShift 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgently needed change necessitates bumping CoreDNS to the latest upstream release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.
For OpenShift 4.9, this means bumping from CoreDNS 1.8.1 to 1.8.3, or possibly a later release should one ship before we do the bump.
Note that CoreDNS upstream does not maintain release branches—that is, once CoreDNS is released, there will be no further 1.8.z releases—so we may be better off updating to 1.9 as soon as it is released, rather than staying on the 1.8 series which would then be unmaintained.
We may consider bumping CoreDNS again during the OpenShift 4.9 release cycle if upstream ships additional releases during the 4.9 development cycle. However, we will need to weigh the risks and available remaining soak time in the release schedule before doing so, should that contingency arise.
As a OpenShift administrator, I would like a solution that allows me to upgrade from one EUS version to another with very few steps and only minimum disruption to application workloads while still allowing new application services to be deployed.
Functional requirements break down into the following prioritized list:
Non-Functional Requirements
Requirement | Notes | isMvp? |
---|---|---|
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Documentation | This is a requirement for ALL end user facing features | YES |
Questions to be addressed:
EUS to EUS Focus Area Discussion: https://docs.google.com/document/d/17I1Wd7-R1wRxmboyv1jUFHFkqQcBTorJccdGi1ZqjQE/edit?usp=sharing
EUS Feature: https://issues.redhat.com/browse/OCPPLAN-5484
Use case
As an Admin, one of my operators says it can't be upgraded. An action is required, as I will be unable to upgrade to a .y minor release until I fix the problem.
Possible Design Solution
Create a message saying you can upgrade to .z patch releases even when one of your cluster operators says it's not upgradeable.
Ideally, the message string on the condition explains what the admin needs to resolve , and until they resolve the issue they can only update within their current z stream.
Questions
Need to do a little R&D to find out when this happens and what happens when you're in this state.
Designs (WIP)
Doc: https://docs.google.com/document/d/1iUZlHbv5nTYtb7Cq4rn_bYPqD4Jtie59xIogxN-2Eyc/edit#heading=h.5eoflxvaj1m4
Configure audit logging to capture login, logout and login failure details
TODO(PM): update this
Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.
Expected results: Login, logout and login failures should be captured in audit logging.
The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.
When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.
Right now there's no way to generate audit logs from this.
Right now there's no way to generate audit logs from this.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
this will allow cipher customization work to be completed.
OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.
While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.
For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This is a ticket meant to track all the all the OCP PRs that are involved in the implementation of the SNO + workers enhancement
In the console-operator repo we need to add `capability.openshift.io/console` annotation to all the manifests that the operator either contains creates on the fly.
Manifests are currently present in /bindata and /manifest directories.
Here is example of the insights-operator change.
Here is the overall enhancement doc.
We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Telemetry | No | |
Certification | No | |
API metrics | No | |
Out of Scope
n/a
Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low
Assumptions
Customer Considerations
Documentation Considerations
Notes
In progress:
High prio:
Unsorted
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all CSI sidecars to the latest upstream release.
This includes update of VolumeSnapshot CRDs in https://github.com/openshift/cluster-csi-snapshot-controller-operator/tree/master/assets
OpenShift console supports new features and elevated experience for Operator Lifecycle Manager (OLM) Operators and Cluster Operators.
OCP Console improves the controls and visibility for managing vendor-provided software in customers’ infrastructure and making these solutions available for customers' internal users.
To achieve this,
We want to make sure OLM’s and Cluster Operators' new features are exposed in the console so admin console users can benefit from them.
Requirement | Notes | isMvp? |
---|---|---|
OCP console supports the latest OLM APIs and features | This is a requirement for ALL features. | YES |
OCP console improves visibility to Cluster Operators related resources and features. | This is a requirement for ALL features. | YES |
(Optional) Use Cases
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Main success scenarios - high-level user stories
* Alternate flow/scenarios - high-level user stories
* ...
Questions to answer...
How will the user interact with this feature?
Which users will use this and when will they use it?
Is this feature used as part of the current user interface?
Out of Scope
<--- Remove this text when creating a Feature in Jira, only for reference --->
# List of non-requirements or things not included in this feature
# ...
Background, and strategic fit
<--- Remove this text when creating a Feature in Jira, only for reference --->
What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Assumptions
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Are there assumptions being made regarding prerequisites and dependencies?
* Are there assumptions about hardware, software or people resources?
* ...
Customer Considerations
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Are there specific customer environments that need to be considered (such as working with existing h/w and software)?
...
Documentation Considerations
<--- Remove this text when creating a Feature in Jira, only for reference --->
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OLM is adding a property to the CSV to signal that the operator should clean up the operand on operator uninstall. See https://github.com/operator-framework/enhancements/pull/46
Console will need to add a checkbox to the UI to prompt ask the user if the operand should be cleaned up (with strong warnings about what this means). On delete, console should set the `spec.cleanup` property on the CSV to indicate whether cleanup should happen.
Additionally, console needs to be able to show proper status for CSVs that are terminating in the UI so it's clear the operator is being deleted and cleanup is in progress. If there are errors with cleanup, those should be surfaced back through the UI.
Depends on OLM-1733
As a user of OperatorHub, I'd like to have an improved "status display" for Operators being installed before so I can better understand if those Operators actually being successfully installed or require additional actions from me to complete the installation.
Improve visibility of Operator installation status on OperatorHub page
OperatorHub page currently shows an Operator as Installed as long as a Subscription object exists for that operator in the current namespace.
This can be misleading because the installation could be stalled or require additional interactions from the user (e.g. "manual upgrade approval") in order to complete the installation.
The console could potentially have some indication of an "in-between" or "requires attention" state for Operators that are in these states + links to the actual "Installed Operators" page for more details.
1. BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1899359
2. RFE: https://issues.redhat.com/browse/RFE-1691
Current version is 2.5.1, and we are still on 1.x. Updating the package is required to support the additional validation keywords in CONSOLE-2807.
https://github.com/rjsf-team/react-jsonschema-form/releases
Breaking changes are listed in the v2 notes:
https://github.com/rjsf-team/react-jsonschema-form/releases/tag/v2.0.0
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
Epic Goal
Why is this important?
Acceptance Criteria
Previous Work (Optional)
Done Checklist
References
As an admin, I want to be able to:
so that I can achieve
The agent based installation for Zero Touch provisioning has a Custom Resource Defined to configure the static networking of the nodes that will be provisioned. E.g:
apiVersion: agent-install.openshift.io/v1beta1 kind: NMStateConfig metadata: name: mgmt-spoke1 namespace: mgmt-spoke1 labels: cluster-name: mgmt-spoke1 spec: config: interfaces: - name: bond0 type: bond link-aggregation: mode: active-backup options: miimon: "140" slaves: - eth0 - eth1 state: up ipv4: enabled: true address: - ip: 192.168.123.151 prefix-length: 24 dhcp: false ipv6: enabled: false dns-resolver: config: server: - 192.168.1.1 routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.1.1 next-hop-interface: bond0 table-id: 254 interfaces: - name: "eth0" macAddress: "00:00:00:00:00:00" - name: "eth1" macAddress: "00:00:00:00:00:11"
NMState team is currently working on a rust library that includes the gc command that assisted service uses to generate all the configs and then load the one that matches the interfaces. We should reach out to Nick Carboni to check on assisted-service progress in integrating the new library and leverage the same code to make sure our ISO can use the same network configuration mechanism
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
We currently support static IPs on Node 0, and this is required in order to get the common IP for the other nodes. We also need to support configuration of static IPs on all of the nodes even though they could also use DHCP for their addresses.
The infraenv controller fetches the NMStateConfigs from the kube-api. Since we don't have the kube-api, we need to read them from the manifests and incorporate them into the InfraEnvCreateParams to create the InfraEnv.
Acceptance criteria:
As an OpenShift infrastructure owner, I need to add host-specific configurations at install time, so that they are applied when the cluster installation is completed.
Specially, but not restricted to on-prem deployments, hosts need specific configurations (beyond the individual host network configuration). Customers automating installs want to avoid day-2 configurations and node reboots, so applying configurations during the installation is a requirement for them. Examples of this are multipath and SCTP on bare metal nodes, where it's not always straightforward to do it on day-2 and reboots are required.
If it is not generated from AgentConfig, we should at least generate a skeleton
There is no harm in supplying the “rd.multipath=default” argument on any host. The effect of this argument is to generate a default /etc/multipath.conf file and to enable the multipathd service. The assisted-service now adds these to its discovery ISOs, and we will do the same with the agent ISO.
Necessary for SCTP
Manifests are placed in <install-config-dir>/openshift and copied to the ISO. (Previously we assumed this would be <install-config-dir>/manifests, but Andrea suggested that openshift would be more consistent.)
A client in the ISO submits the manifests through assisted-service API.
REST
Get the ZTP extra manifests into the image and use the REST API below:
/v2/clusters/{cluster_id}/manifests
Acceptance criteria:
CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
Using code from the installer (not code from fleeting), populate the Ignition asset with the data built in to the installer binary.
Currently we use a separate embed.FS (inherited from fleeting) to load the data files to go into the ignition. We should get rid of this and use the same method as the rest of the installer. We should also use the installer's code to e.g. do templating and convert to ignition format and throw away the fleeting code.
Currently it's possible to specify the release version to be installed via the ClusterImageSet manifests.
Since we're working from within the openshift installer, the accepted version should be the one hard-coded in the installer binary (or overriden by the env var)
Create installer Assets corresponding to each ZTP manifest, and move the code for reading them from disk into the respective assets.
Create an asset for AgentClusterInstall. Parent assets are install-config.yaml and agent-config.yaml.
From the initial install-config.yaml + agent-config.yaml, generate all the ZTP manifests file required by the create image command.
Dependency: install-config
*Note*: we could evaluate to further split this task into distinct manifests assets
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a first step for the assets integration. the create image command will need to fetch the required ztp manifest files from the cluster-manifests folder.
This will allow to:
1) Get the manifest file from the right location
2) seamlessly integrate the create image command with the create cluster-manifests one as the tasks related to assets generation are still in progress
3) Keep the create image command fully working until the assets generation will completed (users will still be able to create/edit manually the assets in the cluster-manifests folder)
Using git-filter-repo, rewrite the commits in fleeting to place files in their correct locations in the installer. The resulting commits can then be merged into the agent branch of the installer with a pull request.
Data files should be moved to e.g. data/data/agent, appending the suffix .template to any that are templated.
Code files that are needed by the installer should be moved to appropriate directories that have the agent team in the OWNERS.
Keep the git-filter-repo script so that development can continue in parallel on fleeting until we are ready to switch CI over to the installer implementation.
Add a subcommand to create the ephemeral ISO.
Create Agent ISO and Agent Ignition assets in the installer, and use them to generate a customized ISO.
This story is just for implementing the mechanics, filling in the ignition will be left to another story.
A cli subcommand that:
Check that the cluster is ready for installation and send the appropriate REST API call to trigger the installation.
The service start-cluster-installation fails for conditionpathexists even though the path is created.
[core@master-0 ~]$ sudo systemctl status start-cluster-installation.service ● start-cluster-installation.service - Service that starts cluster installation Loaded: loaded (/etc/systemd/system/start-cluster-installation.service; enabled; vendor preset: enabled) Active: inactive (dead) Condition: start condition failed at Wed 2022-05-11 04:40:43 UTC; 32s ago └─ ConditionPathExists=/etc/assisted-service/node0 was not met
Also, when the ConditionPath error is fixed, later the service fails with
start-cluster-installation.sh[2533]: jq: error (at <stdin>:0): Cannot index number with string "status"
Currently we allow the assisted-service to generate the InfraEnv ID automatically when the InfraEnv is created. The agents then have to fetch the list of InfraEnvs from the service to get the ID. This is suboptimal in a number of ways and won't be possible at all once we have authentication enabled on the assisted-service API.
Instead, modify assisted-service to accept an environment variable that contains a fixed InfraEnv ID. Any new InfraEnv created will use this ID (this has the desirable side effect that there can be only one InfraEnv).
Pre-generate a random ID in the command-line tool and store it in the configuration of both the agent and the assisted-service in the ISO.
Using podman kube play from a systemd service isn't ideal in terms of process monitoring, and makes it hard to do stuff like attach volumes. Split the containers out into separate containers (which can all be in the same pod still) that are started by their own systemd services. This will mean decomposing the ConfigMap that passes settings.
Fix the unwanted API call to set API_VIP in case of SNO cluster in start-cluster-installation.service.
{"code":"400","href":"","id":400,"kind":"Error","reason":"API VIP cannot be set with User Managed Networking"}
Create a completely golang implementation of AGENT-37 and place the code in the assisted-service repo. A new binary should be created in the assisted-service image. The binary will be used in the create-cluster-and-infra-env service.
As a deployer, I want to be able to:
so that I can achieve
Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.
In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Instead of fmt.Errorf, use a logging library to log the errors and debug information.
A cli subcommand that waits for the cluster to come up. This should be able to reuse the code from the regular openshift-install wait-for install-complete command largely unchanged, but if the k8s API is not available it may be because we're still running the assisted part of installation. It probably needs to fall back to checking for that. I'm not sure what assumptions in the existing installer command about when it is safe to run it. Ideally we would keep behaviour relatively consistent.
Ability to perform disconnected first cluster installation in the automated flow
When installing in a disconnected environment and the registries.conf and ca-bundle files have been loaded these files should be provided to assisted-service as a mount of the mirror/ dir. Assisted-service will updates its ignition config from these mounted files.
In order to configure the registry for disconnected installs, the following assets should be created:
RegistriesConfig (read from mirror/registries.conf)
CABundleCertificates (read from mirror/ca-bundle.crt)
We won't be shipping with the assisted-ui container. At this point it is blocking the disconnected work since we don't have an Openshift container for it in the payload, so its time to remove it.
The Core OS ISO can be extracted from the release payload using a command like:
oc image extract --file=/coreos/coreos-x86_64.iso quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1dc3c2a644f62049ea4a03fddb9305bc2b929405bf979b7f5e720cfadf327b54
Where the SHA points to the machine-os-images container in the release payload (which can be obtained using oc adm release info --image-for=machine-os-images. (Both of these commands require the pull secret for the cluster to be available in your podman config.)
We'll need to use equivalent code (hopefully imported from oc or the same library it uses) to fetch the base ISO using the supplied pull secret in the ZTP manifests and store it as an Asset.
Podman creates a pause container on the hosts for the service pod as follows:
$ sudo podman ps
87a02f9ace39 registry.access.redhat.com/ubi8/pause:latest 58 minutes ago Up 58 minutes ago 0.0.0.0:8080->8080/tcp, 0.0.0.0:8090->8090/tcp, 0.0.0.0:8888->8888/tcp 27f9183bfbd9-infra
We should check if this image needs to be mirrored, and figure out if we need to change dev-scripts or add an entry to registries.conf.
Currently assisted service chooses one of the nodes that reach out to it to be the bootstrap node. We need to understand the choice mechanism and to make it reliably choose the node that we want node0 to be.
The bootstrap node already waits for the other nodes before rebooting, we need to make sure that this wait is sufficient for assisted-service as well. Prevent the assisted-service from rebooting the node it is running on until the following conditions are true:
We can try with having it reboot into bootstrap while making sure that assisted-service runs after reboot but ideally we'd want to have the node start bootstrapping without needing the reboot (As per customer/PM demands to minimize reboots).
In the context of METAL-10 there was a proposal to add a file that the agent would check for, such that the presence of this file would inhibit a reboot. We could possibly use the same mechanism here to avoid the need for large-scale changes to how assisted-service itself works (assisted-service would still need to delete the file at the appropriate time, but that is a less-invasive change). However, there are timeouts that have to be considered, so changes to the state machine may be required.
Note that we do want to continue to install to disk on the assisted-service host in parallel with the others, since this is on the critical path slowing down all deployments. Only the reboot should be delayed.
Single-node deployments are an exception to this.
Set the ClusterDeployment CRD to deploy OpenShift in FIPS mode and make sure that after deployment the cluster is set in that mode
In order to install FIPS compliant clusters, we need to make sure that installconfig + agentoconfig based deployments take into account the FIPS config in installconfig.
This task is about passing the config to agentclusterinstall so it makes it into the iso. Once there, AGENT-374 will give it to assisted service
As an OpenShift infrastructure owner, I want to deploy a cluster zero with RHACM or MCE and have the required components installed when the installation is completed
BILLI makes it easier to deploy a cluster zero. BILLI users know at installation time what the purpose of their cluster is when they plan the installation. Day-2 steps are necessary to install operators and users, especially when automating installations, want to finish the installation flow when their required components are installed.
As a customer, I want to be able to:
so that I can achieve
Description of criteria:
We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a customer, I want to be able to:
so that I can achieve
Description of criteria:
We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
Support user input consisting of just InstallConfig and AgentConfig
If node0 ip is specified in agentConfig, it takes precedence over the selection from NMStateConfigs, otherwise, we keep the same heuristic as we have now to choose.
If we make the ZTP manifest assets depend on the install-config asset, the install config will effectively be required (and the installer will launch into the interactive CLI questionnaire if it is not present).
We want to use the install-config if it is present, and just use the ZTP manifests if those are present instead. (Note: this appears to conflict with what AGENT-135 says, so one of these stories might be wrong.)
The installer team has more details and can probably suggest a design.
Given an install-config, generate the mirroring config assets (registries.conf and ca-bundle.crt) from the data in it.
Modify the agent-config to accept NMState config for each host.
This could be directly inline, or referenced from a file (either explicitly or by implicitly inferring the filename). This is TBD. We decided to go with `AgentConfig embeds install time node-specific configuration` option https://docs.google.com/document/d/1vCy0LikVPhbGIHF494NHTYsfu85fOiOicR3oB1vlEWI/edit#
Using the NMState data provided, generate the equivalent NMStateConfig manifests in cluster-manifests.
Validate the initial config files for the agent installer, ensuring that all the required fields are present and well defined
Given an install-config, convert it to the ZTP manifests that are used to directly populate the Ignition.
This document contains a list of fields and how they match up: https://docs.google.com/document/d/1S4OluK1c-CIma9hmEylPay9ugcqKrD64S7DgiYpufqE/edit
As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with dual-stack IPv4/IPv6
As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with single-stack IPv6
IPv6 and dual-stack clusters are requested often by customers, especially from Telco customers. Working with dual-stack clusters is a requirement for many but also a transition into a single-stack IPv6 clusters, which for some of our users is the final destination.
Karim's work proving how agent-based can deploy IPv6: IPv6 deploy with agent based installer]
For dual-stack installations the agent-cluster-install.yaml must have both an IPv4 and IPv6 subnet in the networkking.MachineNetwork or assisted-service will throw an error. This field is in InstallConfig but it must be added to agent-cluster-install in its Generate().
For IPv4 and IPv6 installs, setting up the MachineNetwork is not needed but it also does not cause problems if its set, so it should be fine to set it all times.
As a user I would like to see all the events that the autoscaler creates, even duplicates. Having the CAO set this flag will allow me to continue to see these events.
We have carried a patch for the autoscaler that would enable the duplication of events. This patch can now be dropped because the upstream added a flag for this behavior in https://github.com/kubernetes/autoscaler/pull/4921
Add GA support for deploying OpenShift to IBM Public Cloud
Complete the existing gaps to make OpenShift on IBM Cloud VPC (Next Gen2) General Available
Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.
Requirements | Notes | IS MVP |
Discover new offerings in Home Dashboard | Y | |
Access details outlining value of offerings | Y | |
Access step-by-step guide to install offering | N | |
Allow developers to easily find and use newly installed offerings | Y | |
Support air-gapped clusters | Y |
< What are we making, for who, and why/what problem are we solving?>
Discovering solutions that are not available for installation on cluster
No known dependencies
Background, and strategic fit
None
Quick Starts
Developers using Dev Console need to be made aware of the RH developer tooling available to them.
Provide awareness to developers using Dev Console of the RH developer tooling that is available to them, including:
Consider enhancing the +Add page and/or the Guided tour
Provide a Quick Start for installing the Cryostat Operator
To increase usage of our RH portfolio
This issue is to handle the PR comment - https://github.com/openshift/console-operator/pull/770#pullrequestreview-1501727662 for the issue https://issues.redhat.com/browse/ODC-7292
In testing dual stack on vsphere we discovered that kubelet will not allow us to specify two ips on any platform except baremetal. We have a couple of options to deal with that:
As a developer
I want OpenShift builds to support cgroups v2
So that I can run OpenShift builds on clusters that have cgroups v2 enabled
None - this is an implementation detail which should not impact end-users directly.
Originally filed in https://bugzilla.redhat.com/show_bug.cgi?id=1949438
Add support for custom security groups to be attached to control plane and compute nodes at installation time.
Allow the user to provide existing security groups to be attached to the control plane and compute node instances at installation time.
The user will be able to provide a list of existing security groups to the install config manifest that will be used as additional custom security groups to be attached to the control plane and compute node instances at installation time.
The installer won't be responsible of creating any custom security groups, these must be created by the user before the installation starts.
We do have users/customers with specific requirements on adding additional network rules to every instance created in AWS. For OpenShift these additional rules need to be added on day-2 manually as the Installer doesn't provide the ability to add custom security groups to be attached to any instance at install time.
MachineSets already support adding a list of existing custom security groups, so this could be automated already at install time manually editing each MachineSet manifest before starting the installation, but even for these cases the Installer doesn't allow the user to provide this information to add the list of these security groups to the MachineSet manifests.
Documentation will be required to explain how this information needs to be provided to the install config manifest as any other supported field.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
additionalSecurityGroupIDs:
description: AdditionalSecurityGroupIDs contains IDs of
additional security groups for machines, where each ID
is presented in the format sg-xxxx.
items:
type: string
type: array
This requires/does not require a design proposal.
Feature
As an Infrastructure Administrator, I want to deploy OpenShift on vSphere with supervisor (aka Masters) and worker nodes (from a MachineSet) across multiple vSphere data centers and multiple vSphere clusters using full stack automation (IPI) and user provided infrastructure (UPI).
MVP
Install OpenShift on vSphere using IPI / UPI in multiple vSphere data centers (regions) and multiple vSphere clusters in 1 vCenter, all in the same IPv4 subnet (in the same physical location).
Out of scope
Scenarios for consideration:
Acceptance criteria:
References:
As an openshift engineer make changes to various openshift components so that vSphere zonal installation is considered GA.
As a openshift engineer create an additional UPI terraform for zonal so that it can be tested in CI.
As a openshift engineer depreciate existing vSphere platform spec parameters so that they can eventually be removed in favor of zonal.
As a openshift engineer implement a new job for upi zonal so that method of installation is tested.
As a openshift engineer I need to follow the process to move the api from tech preview to ga so it can be used by clusters not installed with TechPreviewNoUpgrade.
more to follow...
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
This epic covers the work to apply user defined labels GCP resources created for openshift cluster available as tech preview.
The user should be able to define GCP labels to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags/labels in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for labeling when the resources are created.
Updating/deleting of labels added during cluster creation or adding new labels as Day-2 operation is out of scope of this epic.
List any affected packages or components.
Reference - https://issues.redhat.com/browse/RFE-2017
Enhancement proposed for GCP labels and tags support in OCP requires making use of latest APIs made available in terraform provider for google and requires an update to use the same.
Acceptance Criteria
Installer generates Infrastructure CR in manifests creation step of cluster creation process based on the user provided input recorded in install-config.yaml. While generating Infrastructure CR platformStatus.gcp.resourceLabels should be updated with the user provided labels(installconfig.platform.gcp.userLabels).
Acceptance Criteria
Enhancement proposed for Azure tags support in OCP, requires install-config CRD to be updated to include gcp userLabels for user to configure, which will be referred by the installer to apply the list of labels on each resource created by it and as well make it available in the Infrastructure CR created.
Below is the snippet of the change required in the CRD
apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: installconfigs.install.openshift.io spec: versions: - name: v1 schema: openAPIV3Schema: properties: platform: properties: gcp: properties: userLabels: additionalProperties: type: string description: UserLabels additional keys and values that the installer will add as labels to all resources that it creates. Resources created by the cluster itself may not include these labels. type: object
This change is required for testing the changes of the feature, and should ideally get merged first.
Acceptance Criteria
Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined labels and the default OCP label kubernetes-io-cluster-<cluster_id>:owned
Resources List
Resource | Terraform API |
---|---|
VM Instance | google_compute_instance |
Image | google_compute_image |
Address | google_compute_address(beta) |
ForwardingRule | google_compute_forwarding_rule(beta) |
Zones | google_dns_managed_zone |
Storage Bucket | google_storage_bucket |
Acceptance Criteria:
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The CVO README is currently aimed at CVO devs. But there are way more CVO consumers than there are CVO devs. We should aim the README at "what does the CVO do for my clusters?", and push the dev docs down under docs/dev/.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
{}USER STORY:{}
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
{}DESCRIPTION:{}
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
{}Required:{}
...
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
Goal
Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils
Feature Description
In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.
Goal
Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils
Feature Description
In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.
Currently the `agent create image` command takes care to extract the agent-tui binary (and required libs) from the `assisted-installer-agent` image (shipped in the release as `agent-installer-node-agent`).
Once the agent-tui will be available instead from the `agent-installer-utils` image, it would be necessary to update accordingly the installer code (see https://github.com/openshift/installer/blob/56e85bee78490c18aaf33994e073cbc16181f66d/pkg/asset/agent/image/agentimage.go#L81)
Allow users to interactively adjust the network configuration for a host after booting the agent ISO.
Configure network after host boots
The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.
When the UI is active in the console events messages that are generated will distort the interface and make it difficult for the user to view the configuration and select options. An example is shown in the attached screenshot.
The openshift-install agent create image will need to fetch the agent-tui executable so that it could be embedded within the agent ISO. For this reason the agent-tui must be available in the release payload, so that it could be retrieved even when the command is invoked in a disconnected environment.
The node zero ip is currently hard-coded inside set-node-zero.sh.template and in the ServiceBaseURL template string.
ServiceBaseURL is also hard-coded inside:
We need to remove this hard-coding and to allow a user to be able to set the node zero ip through the tui and have it be reflected by the agent services and scripts.
As a user, I need information about common misconfigurations that may be preventing the automated installation from proceeding.
If we are unable to access the release image from the registry, provide sufficient debugging information to the user to pinpoint the problem. Check for:
Create a systemd service that runs at startup prior to the login prompt and takes over the console. This should start after the network-online target, and block the login prompt appearing until it exits.
This should also block, at least temporarily, any services that require pulling an image from the registry (i.e. agent + assisted-service).
In the console service from AGENT-453, check whether we are able to pull the release image, and display this information to the user before prompting to run nmtui.
If we can access the image, then exit the service if there is no user input after some timeout, to allow the installation to proceed in the automation flow.
Enhance the openshift-install agent create image command so that the agent-nmtui executable will be embedded in the agent ISO
After having created the agent ISO, the agent-nmtui must be added to the ISO using the following approach:
1. Unpack the agent ISO in a temporary folder
2. Unpack the /images/ignition.img compressed cpio archive in a temporary folder
3. Create a new ignition.img compressed cpio archive by appending the agent-nmtui
2. Create a new agent ISO with the updated ignition.img
Implementation note
Portions of code from a PoC located at https://github.com/andfasano/gasoline could be re-used
When running the openshift-install agent create image command, first of all it needs to extract the agent-tui executable from the release payload in a temporary folder
As our customers create more and more clusters, it will become vital for us to help them support their fleet of clusters. Currently, our users have to use a different interface(ACM UI) in order to manage their fleet of clusters. Our goal is to provide our users with a single interface for managing a fleet of clusters to deep diving into a single cluster. This means going to a single URL – your Hub – to interact with your OCP fleet.
The goal of this tech preview update is to improve the experience from the last round of tech preview. The following items will be improved:
Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster.
Why customers want this?
Why we want this?
Phase 2 Goal: Productization of the united Console
In order for hub cluster console OLM screens to behave as expected in a multicluster environment, we need to gather "copiedCSVsDisabled" flags from managed clusters so that the console backend/frontend can consume this information.
AC:
Description of problem:
There is a possible race condition in the console operator where the managed cluster config gets updated after the console deployment and doesn't trigger a rollout.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Rarely
Steps to Reproduce:
1. Enable multicluster tech preview by adding TechPreviewNoUpgrade featureSet to FeatureGate config. (NOTE THIS ACTION IS IRREVERSIBLE AND WILL MAKE THE CLUSTER UNUPGRADEABLE AND UNSUPPORTED) 2. Install ACM 2.5+ 3. Import a managed cluster using either the ACM console or the CLI 4. Once that managed cluster is showing in the cluster dropdown, import a second managed cluster
Actual results:
Sometimes the second managed cluster will never show up in the cluster dropdown
Expected results:
The second managed cluster eventually shows up in the cluster dropdown after a page refresh
Additional info:
Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2055415
Allow to configure compute and control plane nodes on across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.
I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and compute nodes across multiple subnets.
I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.
Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.
Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.
Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.
While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.
Epic | Control Plane with Multiple Subnets | Compute with Multiple Subnets | Doesn't need external LB | Built-in LB |
---|---|---|---|---|
NE-1069 (all-platforms) | ✓ | ✓ | ✓ | ✓ |
NE-905 (all-platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ||
✓ | ✓ | ✓ | ✕ | |
NE-905 (all platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ |
Workers on separate subnets with IPI documentation
We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow
This procedure works on vSphere too, albeit no QE CI and not documented.
External load balancer with IPI documentation
As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers
Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy.
vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409
As an OpenShift installation admin I want to use the Assisted Installer, ZTP and IPI installation workflows to deploy a cluster that has remote worker nodes in subnets different from the local subnet, while my VIPs with the built-in load balancing services (haproxy/keepalived).
While this request is most common with OpenShift on bare metal, any platform using the ingress operator will benefit from this enhancement.
Customers using platform none run external load balancers and they won't need this, this is specific for platforms deployed via AI, ZTP and IPI.
Customers and partners want to install remote worker nodes on day1. Due to the built-in network services we provide with Assisted Installer, ZTP and IPI that manage the VIP for ingress, we need to ensure that they remain in the local subnet where the VIPs are configured.
The bare metal IPI tam added a workflow that allows to place the VIPs in the masters. While this isn't an ideal solution, this is the only option documented:
Configuring network components to run on the control plane
Goal:
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.
Description:
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.
For OpenShift 4.13, this means bumping to 2.6.
As a cluster administrator,
I want OpenShift to include a recent HAProxy version,
so that I have the latest available performance and security fixes.
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.
For OpenShift 4.14, this means bumping to 2.6.
Bump the HAProxy version in dist-git so that OCP 4.13 ships HAProxy 2.6.13, with this patch added on top: https://git.haproxy.org/?p=haproxy-2.6.git;a=commit;h=2b0aafdc92f691bc4b987300c9001a7cc3fb8d08. The patch fixes the segfault that was being tracked as OCPBUGS-13232.
This patch is in HAProxy 2.6.14, so we can stop carrying the patch once we bump to HAProxy 2.6.14 or newer in a subsequent OCP release.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
Goal
Allow to point to an existing OVA image stored in vSphere from the OpenShift installer, replacing the current method that uploads the OVA template every time an OpenShift cluster is installed.
Why is this important?
This is an improvement that makes the installation more efficient by not having to upload an OVA from where openshift-install is running every time a cluster is installed, saving time and bandwidth use. For example if an administrating is installing from a VPN then the OVA is upload through it to the target cluster every time an OpenShift cluster is installed. This makes the administration process more efficient by having a OVA centralised ready to use to install new clusters without uploading it from where the installer is run.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.
To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).
OCPBU-5: Phase 1
OCPBU-510: Phase 2
OCPBU-329: Phase.Next
Phase 1
Phase 2
Phase 3
As a user I want to use the openshift installer to create clusters of platform type External so that I can use openshift more effectively on a partner provider platform.
To fully support the External platform type for partners and users, it will be useful to be able to have the installer understand when it sees the external platform type in the install-config.yaml, and then to properly populate the resulting infrastructure config object with the external platform type and platform name.
As defined in https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L241 , the external platform type allows the user to specify a name for the platform. This card is about updating the installer so that a user can provide both the external type and a platform name that will be expressed in the infrastructure manifest.
Aside from this information, the installer should continue with a normal platform "None" installation.
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1.https://issues.redhat.com/browse/ARMOCP-346
Open questions::
1. …
Done Checklist
As an OCP admistrator, I would like to deploy OCP on arm64 BM with agent installer
Acceptance Criteria
Dev:
Jira Admin
QE
Docs:
Agent Installer
Support OpenShift installation in AWS Shared VPC [1] scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.
As a user I need to use a Shared VPC [1] when installing OpenShift on AWS into an existing VPC. Which will at least require the use of a preexisting Route53 hosted zone where I am not allowed the user "participant" of the shared VPC to automatically create Route53 private zones.
The Installer is able to successfully deploy OpenShift on AWS with a Shared VPC [1], and the cluster is able to successfully pass osde2e testing. This will include at least the scenario when private hostedZone belongs to different account (Account A) than cluster resources (Account B)
[1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
I want
so that I can
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.
With this BYON support:
`networkResourceGroupName` NOT specified ==> non-BYON install scenario
`networkResourceGroupName` specified ==> BYON install scenario (required for BYON scenario)
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description/Acceptance Criteria:
Pre-Work Objectives
Since some of our requirements from the ACM team will not be available for the 4.12 timeframe, the team should work on anything we can get done in the scope of the console repo so that when the required items are available in 4.13, we can be more nimble in delivering GA content for the Unified Console Epic.
Overall GA Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster.
Why customers want this?
Why we want this?
Phase 2 Goal: Productization of the united Console
As a developer I would like to disable clusters like *KS that we can't support for multi-cluster (for instance because we can't authenticate). The ManagedCluster resource has a vendor label that we can use to know if the cluster is supported.
cc Ali Mobrem Sho Weimer Jakub Hadvig
UPDATE: 9/20/22 : we want an allow-list with OpenShift, ROSA, ARO, ROKS, and OpenShiftDedicated
Acceptance criteria:
Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster.
Why customers want this?
Why we want this?
Phase 1 Goal: Get something to market (OCP 4.8, ACM 2.3)
Phase 1 —> OCP deploys ACM Hub Operator —> ACM Perspective becomes available —> User can switch between ACM multi-cluster view and local OCP Console —> No SSO user has to login in twice
Phase 2 Goal: Productization of the united Console (OCP 4.9, ACM 2.4)
Phase 2 Use Cases:
We need to coordinate with the ACM team so that the masthead looks the same when switching between contexts. This might require us to consume a common masthead component in OCP console.
The ACM team will need to honor our custom branding configuration so that the logo does not change when switching contexts.
Known differences:
Open questions:
Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster.
Why customers want this?
Why we want this?
Phase 2 Goal: Productization of the united Console
We need a way to show metrics for workloads running on spoke clusters. This depends on ACM-876, which lets the console discover the monitoring endpoints.
Open Issues:
We will depend on ACM to create a route on each spoke cluster for the prometheus tenancy service, which is required for metrics for normal users.
Openshift console backend should proxy managed cluster monitoring requests through the MCE cluster proxy addon to prometheus services on the managed cluster. This depends on https://issues.redhat.com/browse/ACM-1188
Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.
This will be needed when we will support different OS types on the cluster.
We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,
AC:
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Many enterprises have strict security policies where all the software must be pulled from a trusted or private source. For these scenarios the RHCOS image used to bootstrap the cluster is usually coming from shared public locations that some companies don't accept as a trusted source.
Questions to be addressed:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
ARO needs to copy RHCOS image blobs to their own Azure Marketplace offering since, as a first party Azure service, they must not request anything from outside of Azure and must consume RHCOS VM images from a trusted source (marketplace). To meet the requirements ARO team currently does the following as part of the release process: 1. Mirror container images from quay.io to Azure Container Registry to avoid leaving Azure boundaries. 2. Copy VM image from the blob in someone else's Azure subscription into the blob on the subscription ARO team manages and then we publish a VM image on Azure Marketplace (publisher: azureopenshift, offer: aro4. See az vm image list --publisher azureopenshift --all). We do not bill for these images. The usage of Marketplace images in the installer was already implemented as part of CORS-1823. This single line [1] needs to be refactored to enable ARO from the installer code perspective: on ARO we don't need to set type to AzureImageTypeMarketplaceWithPlan. However, in OCPPLAN-7556 and related CORS-1823 it was mentioned that using Marketplace images is out of scope for nodes other than compute. For ARO we need to be able to use marketplace images for all nodes. [1] https://github.com/openshift/installer/blob/f912534f12491721e3874e2bf64f7fa8d44aa7f5/pkg/asset/machines/azure/machines.go#L107
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Set RHCOS image from Azure Marketplace in the installconfig 2. Deploy a cluster 3.
Actual results:
Only compute nodes use the Marketplace image.
Expected results:
All nodes created by the Installer use RHCOS image coming from Azure Marketplace.
Additional info:
A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option
https://issues.redhat.com/browse/CORS-1103
As a user, I want to be able to:
so that I can achieve
A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option
Some background on the Licenses field:
https://github.com/openshift/installer/pull/3808#issuecomment-663153787
https://github.com/openshift/installer/pull/4696
So we do not want to allow licenses to be specified (it's up to customers to create a custom image with licenses embedded and supply that to the Installer) when pre-built images are specified (current behaviour). Since we don't need to specify licenses for RHCOs images anymore, the Licenses field is useless and should be deprecated.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Background
Issue
The default instance type the installer currently chooses for Single
Node Openshift clusters doesn't follow our documented minimum
requirements
Solution
When the number of replicas of the ControlPlane pool is 1, the installer
will now choose `2xlarge` instead of `xlarge`.
Caveat
`2xlarge` has 32GiB of RAM, which is twice as much as we need, but it's
the best we can do to meet the minimum single-node requirements, because
AWS doesn't offer a 16GiB RAM instance type with 8 cores.
Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.
To enable full support for control plane machine sets on GCP
Any other cloud platforms
Feature created from split of overarching Control Plane Machine Set feature into single release based effort
n/a
Nothing outside documentation that shows the Azure platform is supported as part of Control Plane Machine Sets
n/a
Goal:
Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem:
There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important:
Lifecycle Information:
Previous Work:
Dependencies:
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
As a developer, I want to be able to:
so that I can achieve
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
More details at ARO managed identity scope and impact.
This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
As a cluster admin I want to be able to:
so that I can
Description of criteria:
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Implement CVO portions of spike Add field in ClusterVersion spec to request the target architecture
Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
Remove code references related to Azure Tags is for TechPreview in below list
Goal:
Support migration from dual-stack IPv6 to single-stack IPv6.
Why is this important?
We have customers who want to deploy a dual stack cluster and then (eventually) migrate to single stack ipv6 once all of their ipv4 dependencies are eliminated. Currently this isn't possible because we only support ipv4-primary dual stack deployments. However, with the implementation of OPNET-1 we addressed many of the limitations that prevented ipv6-primary, so we need to figure out what remains to make this supported.
At the very least we need to remove the validations in the installer that requires ipv4 to be the primary address. There will also be changes needed in dev-scripts to allow testing (an option to make the v6 subnets and addresses primary, for example).
We have customers who want to deploy a dual stack cluster and then migrate to single stack ipv6 once all of their ipv4 dependencies are eliminated. Currently this isn't possible because we only support ipv4-primary dual stack deployments. However, with the implementation of OPNET-1 we addressed many of the limitations that prevented ipv6-primary, so we need to figure out what remains to make this supported. At the very least we need to remove the validations in the installer that require ipv4 to be the primary address. There will also be changes needed in dev-scripts to allow testing (an option to make the v6 subnets and addresses primary, for example).
The installer currently enforces ipv4-primary for dual stack deployments. We will need to remove/modify those validations to allow an ipv6-primary configureation.
Runtimecfg assumes ipv4-primary in some places today and we need to make that aware of whether a cluster is v4 or v6 primary.
Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
This epic covers the work to apply user defined tags to Azure created for openshift cluster available as tech preview.
The user should be able to define the azure tags to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for tagging when the resources are created.
Updating/deleting of tags added during cluster creation or adding new tags as Day-2 operation is out of scope of this epic.
List any affected packages or components.
Reference - https://issues.redhat.com/browse/RFE-2017
Installer creates below list of resources during create cluster phase and these resources should be applied with the user defined tags and the default OCP tag kubernetes.io/cluster/<cluster_name>:owned
Resources List
Resource | Terraform API |
---|---|
Resource group | azurerm_resource_group |
Image | azurerm_image |
Load Balancer | azurerm_lb |
Network Security Group | azurerm_network_security_group |
Storage Account | azurerm_storage_account |
Managed Identity | azurerm_user_assigned_identity |
Virtual network | azurerm_virtual_network |
Virtual machine | azurerm_linux_virtual_machine |
Network Interface | azurerm_network_interface |
Private DNS Zone | azurerm_private_dns_zone |
DNS Record | azurerm_dns_cname_record |
Acceptance Criteria:
Installer generates Infrastructure CR in manifests creation step of cluster creation process based on the user provided input recorded in install-config.yaml. While generating Infrastructure CR platformStatus.azure.resourceTags should be updated with the user provided tags(installconfig.platform.azure.userTags).
Acceptance Criteria
Issues found by QE team during pre-merge tests are reported in QE Tracker, which should be fixed.
Acceptance criteria:
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.
To enable full support for control plane machine sets on Azure
Any other cloud platforms
Feature created from split of overarching Control Plane Machine Set feature into single release based effort
n/a
Nothing outside documentation that shows the Azure platform is supported as part of Control Plane Machine Sets
n/a
Goal:
Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem:
There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important:
Lifecycle Information:
Previous Work:
Dependencies:
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
As a developer, I want to be able to:
so that I can achieve
Description of criteria:
This does not require a design proposal.
This does not require a feature gate.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a developer of serverless functions, we don't provide any samples.
Provide Serverless Function samples in the sample catalog. These would be utilizing the Builder Image capabilities.
As an operator author, I want to provide additional samples that are tied to an operator version, not an OpenShift release. For that, I want to create a resource to add new samples to the web console.
As Arm adoption grows OpenShift on Arm is a key strategic initiative for Red Hat. Key to success is the support of all key cloud providers adopting this technology. Google have announced support for Arm in their GCP offering and we need to support OpenShift in this configuration.
The ability to have OCP on Arm running in a GCP instance
OCP on Arm running in a GCP instance
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Description:
Update 4.14 documentation to reflect new GCP support on ARM machines.
Updates:
Acceptance criteria:
Description:
In order to add instance types to the OCP documentation, there needs to be a .md file in the OpenShift installer repo that contains the 64-bit ARM machine types that have been texted and are supported on GCP.
Create a PR in the OpenShift installer repo that creates a new .md file that shows the supported instance types
Acceptance criteria:
Extend the Workload Partitioning feature to support multi-node clusters.
Customers running RAN workloads on C-RAN Hubs (i.e. multi-node clusters) that want to maximize the cores available to the workloads (DU) should be able to utilize WP to isolate CP processes to reserved cores.
Requirements
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
< How will the user interact with this feature? >
< Which users will use this and when will they use it? >
< Is this feature used as part of current user interface? >
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
Check on FeatureSet Techpreview no longer needed on the installer, removing the check from the code.
Since this feature requires that it be turned on ONLY at install time, and can not be turned off, the best place we've found to set the Infrastructure.Status option is through the openshift installer. This has a few benefits, the primary of which being simplifying how this feature get's used by upstream teams such as Assisted Installer and ZTP. If we expose this option as an install config it makes it trivial for those consumers to support turning on this feature at install time.
We'll need to update the openshift installer configuration option to support a flag for CPU Partitioning at install time.
We'll need to add a new flag to the InstallConfig
cpuPartitioningMode: None | AllNode
Add support to Installer to bootstrap cluster with the configurations for CPU Partitioning based off of the infrastructure flag and NTO generated configurations.
We need to call NTO bootstrap render during the bootstrap cycle. This will follow the same pattern that MCO follows and other components that render during bootstrap.
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
For users who are using OpenShift but have not yet begun to explore multicluster and we we offer them.
I'm investigating where Learning paths are today and what is required.
As a user I'd like to have learning path for how to get started with Multicluster.
Install MCE
Create multiple clusters
Use HyperShift
Provide access to cluster creation to devs via templates
Scale up to ACM/ACS (OPP?)
Status
https://github.com/patternfly/patternfly-quickstarts/issues/37#issuecomment-1199840223
Goal: Resources provided via the Dynamic Resource Allocation Kubernetes mechanism can be consumed by VMs.
Details: Dynamic Resource Allocation
Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.
The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.
NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.
When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.
Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs [7]. There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
Part of making https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation available for early adoption.
Story: As an OpenShift admin I want the internal registry of the cluster use storage from Azure Stack Hub so that I can run a fully supported OpenShift environment on that infrastructure provider.
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.
The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:
Requirement | Notes | isMvp? |
---|---|---|
UI to enable and disable plugins | YES | |
Dynamic Plugin Framework in place | YES | |
Testing Infra up and running | YES | |
Docs and read me for creating and testing Plugins | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Documentation Considerations
Questions to be addressed:
The dynamic plugins enhancement describes a `disable-plugins` query parameter for disabling specific console plugins.
This has no effect on static plugins, which are built into the Console application.
We need a UI for enabling and disabling dynamic plugins. The plugins will be discovered either through a custom resource or an annotation on the operator CSV. The enabled plugins will be persisted through the operator config (consoles.operator.openshift.io).
This story tracks enabling and disabling the plugin during operator install through Cluster Settings. This is needed in the future if a plugin is installed outside of an OLM operator.
UX design: https://github.com/openshift/openshift-origin-design/pull/536
We need to support localization of dynamic plugins. The current proposal is to have one i18n namespace per dynamic plugin with a fixed name: `${plugin-name}-plugin`. Since console will know the list of plugins on startup, it can add these namespaces to the i18next config.
The console backend will need to implement an endpoint at the i18next load path. The endpoint will see if the namespace matches the known plugin namespaces. If so, it will proxy to the plugin. Otherwise it will serve the static file from the local filesystem.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.
The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.
Example:
Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster
Note: We should consider to backport this to 4.10
When OCP is performing cluster upgrade user should be notified about this fact.
There are two possibilities how to surface the cluster upgrade to the users:
AC:
Note: We need to decide if we want to distinguish this particular notification by a different color? ccing Ali Mobrem
Created from: https://issues.redhat.com/browse/RFE-3024
As a admin, I want to be able to access the node logs from the nodes detail page in order to troubleshoot what is going on with the node.
We should support getting node logs for different units for node journal logs and evaluate the other CLI flags.
We currently have a gap with the CLI:
We need to investigate whether the k8s API supports WebSockets for streaming node logs.
Currently we are showing system projects within the list view of the Projects page. As stated here https://issues.redhat.com/browse/RFE-185, there are many projects that are considered as system projects that are not important to the user. The value should be remember across sessions, but it something we should be able to toggle directly from the list.
In OpenShift, reserved namespaces are `default`, `openshift`, and those that start with `openshift-`, `kubernetes-`, or `kube-`.
Goal
By default the Cluster Utilization card should not include metrics from `master` nodes in its queries for CPU, Memory, Filesystem, Network, and Pod count.
A new filter option should allow users to toggle between a combined view of what is seen on the Cluster Utilization card today, which is mostly useful on small clusters where masters are schedulable for user workloads.
Assets
Background
As discussed in this thread, the`kube_node_role` metric available since 4.3 should allow us to filter the card's PromQL queries to not include master node metrics.
This filtered view would likely make the card's data more useful for users who aren't running their workloads on masters, like OpenShift Dedicated users.
As noted by some folks during design discussions, this filter isn't perfect, and wouldn't filter out the data from "Infra" nodes that users may have set up using labels/taints. Until we determine a good way to provide more advanced filtering, this basic "Include masters" checkbox is still more flexible than what the card offers today.
Requirements
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.
This feature will be used to track all the CAPI preparation work that is common for all the supported providers
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
PoC & design for running CAPI control plane using binaries.
Epic Goal*
Provide a long term solution to SELinux context labeling in OCP.
Why is this important? (mandatory)
As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.
https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.
This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context
More on design & scenarios in the KEP and related epic STOR-1173
Dependencies (internal and external) (mandatory)
None for the core feature
However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
This Epic is to track upstream work in the Storage SIG community
This Epic is to track the SELinux specific work required. fsGroup work is not included here.
Goal:
Continue contributing to and help move along the upstream efforts to enable recursive permissions functionality.
Finish current SELinuxMountReadWriteOncePod feature upstream:
The feature is probably going to stay alpha upstream.
Problem:
Recursive permission change takes very long for fsGroup and SELinux. For volumes with many small files Kubernetes currently does a chown for every file on the volume (due to fsGroup). Similarly for container runtimes (such as CRI-O) a chcon of every file on the volume is performed due to SCC's SELinux context. Data on the volume may already have the correct GID/SELinux context so Kubernetes needs way to detect this automatically to avoid the long delay.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Customers:
Open questions:
Notes:
As OCP developer (and as OCP user in the future), I want all CSI drivers shipped as part of OCP to support mounting with -o context=XYZ, so I can test with CSIDriver.SELinuxMount: true (or my pods are running without CRI-O recursively relabeling my volume).
In detail:
Exit criteria:
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals
Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
As an OpenShift engineer I want to be able to install the new manifest generation tool as a standalone tool in my CAPI Infra Provider repo to generate the CAPI Provider transport ConfigMap(s)
Renaming of the CAPI Asset/Manifest generator from assets (generator) to manifest-gen, as it won't need to generate go embeddable assets anymore, but only manifests that will be referenced and applied by CVO
As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps
Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.
The Agent Based installer is a clean and simple way to install new instances of OpenShift in disconnected environments, guiding the user through the questions and information needed to successfully install an OpenShift cluster. We need to bring this highly useful feature to the IBM Power and IBM zSystem architectures
Agent based installer on Power and zSystems should reflect what is available for x86 today.
Able to use the agent based installer to create OpenShift clusters on Power and zSystem architectures in disconnected environments
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
As the multi-arch engineer, I would like to build an environment and deploy using Agent Based installer, so that I can confirm if the feature works per spec.
Acceptance Criteria
Enable openshift-install to create agent based install ISO for power.
As a managed application services developer, I want to install addons, use syncsets, scale nodes and query ingresses, so that I offer Red Hat OpenShift Streams on Azure.
Integration Testing:
Beta:
GA:
GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.
Links to Gdocs, github, and any other relevant information about this epic.
As an ARO customer, I want to be able to:
so that I can
Description of criteria:
The installer will not accept a separate service principal to pass to the cluster as described in HIVE-1794. Instead Hive will write the separate cred into the manifests.
Due to low customer interest of using Openshift on Alibaba cloud we have decided to deprecate then remove the IPI support for ALibaba Cloud
4.14
Announcement
4.15
Archive code
Add a warning of depreciation in installer code for anyone trying to install Alibaba via IPI
{}USER STORY:{}
As an user of the installer binary, I want to be warned that Alibaba support will be deprecated in 4.15, so that I'm prevented from creating clusters that will soon be unsupported.
{}DESCRIPTION:{}
Alibaba support will be decommissioned from both IPI and UPI starting in 4.15. We want to warn users of the 4.14 installer binary picking 'alibabacloud' in the list of providers.
{}ACCEPTANCE CRITERIA:{}
Warning message is displayed after choosing 'alibabacloud'.
{}ENGINEERING DETAILS:{}
The deprecation of support for the Alibaba Cloud platform is being postponed by one release, so we need to revert SPLAT-1094.
The storage operators need to be automatically restarted after the certificates are renewed.
From OCP doc "The service CA certificate, which issues the service certificates, is valid for 26 months and is automatically rotated when there is less than 13 months validity left."
Since OCP is now offering an 18 months lifecycle per release, the storage operator pods need to be automatically restarted after the certificates are renewed.
The storage operators will be transparently restarted. The customer benefit should be transparent, it avoids manually restart of the storage operators.
The administrator should not need to restart the storage operator when certificates are renew.
This should apply to all relevant operators with a consistent experience.
As an administrator I want the storage operators to be automatically restarted when certificates are renewed.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
This feature request is triggered by the new extended OCP lifecycle. We are moving from 12 to 18 months support per release.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
No doc is required
This feature only cover storage but the same behavior should be applied to every relevant components.
The pod `openstack-manila-csi-controllerplugin` mounts the secret:
$ cat assets/controller.yaml ... containers: - name: provisioner-kube-rbac-proxy volumeMounts: - mountPath: /etc/tls/private name: metrics-serving-cert volumes: - name: metrics-serving-cert secret: secretName: manila-csi-driver-controller-metrics-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers
Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy.
vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409
As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers
Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy.