DMTN-079: Investigations for Consolidating System Management and Deployment

  • Joel Plutchak

Latest Revision: 2018-05-03

Note

This technote is not yet published.

Notes and recommendations of an investigation into third-party tools for consolidating system deployment and management across LSST enclaves and physical sites

1   Hosting, Managing, and Consuming Yum Repos: Pakrat, Katello, and More

Managed Yum repositories are very important for the sake of reproducibility and control.

  • The specific nature of the LSST project does not always allow us to rebuild nodes (e.g., from xCAT) in order to update them so we must be able to apply Yum updates from a controlled source.
  • We need to be able to (re)build and patch each node up to state that is consistent with other nodes, so locking repos into “snapshots” is important.
  • We may need to be able to roll back to a previous set up patches for the sake of recovering from an issue, so retaining previous repo “snapshots” is important.
  • We need to be able to “branch” our repos so that dev and test machines can see newer “snapshots” and production machines need to see slightly older “snapshots”, so having multiple active “snapshots” is important.

1.1   Solutions for Managing and Hosting Yum Repos

1.1.1   Pakrat + createrepo + web server

1.1.1.1   Overview

Pakrat <https://github.com/ryanuber/pakrat> is a Python-based tool for mirroring and versioning Yum repositories. In our investigation/setup we are running it with a wrapper script via CRON (originally this was weekly but we’ve moved to daily). Each time the wrapper script runs Pakrat syncs several repos and then uses createrepo to rebuild the Yum metadata for the repo. Apache is used to serve out the repos.

Each repo synced by Pakrat consists of:

  • a top-level ‘Packages’ directory - stores RPMs
  • sub-folders for each versioned snapshot - looks like a Yum repo and contains metadata for a given point in time
  • a ‘latest’ symlink pointing to the most recent snapshot (we’re not using this symlink)

Each repo snapshot consists of a symlink to the top-level ‘Packages’ directory and a unique ‘repodata’ metadata sub-folder. The ‘repodata’ is created immediately after syncing on a given date and it only refers to RPMs that are available in ‘Packages’ at that the time. As long as ‘repodata’ is not recreated in a given snapshot folder machines using that repo will not be able to access additional RPMs added to ‘Packages’ in the intervening time.

Here’s a diagram of the structure of a repo:

  • repobase/
    • 2017-07-10/
      • Packages -> ../Packages
      • repodata/
    • 2017-07-17/
      • Packages -> ../Packages
      • repodata/
    • ...
    • 2017-07-31/
      • Packages -> ../Packages
      • repodata/
    • latest -> 2017-07-31
    • Packages/

In our investigation/setup we point Yum clients to a specific snapshot so that they are in a consistent, repeatable state. We have the ability to point test clients to a newer snapshot. We have Pakrat set to sync monthly.

1.1.1.2   Storage baselining (~7/3/2017 - 9/20/2017; includes CentOS 7.3 & 7.4)

repo

raw size,

single day

synced

daily

(~11 7 days)

comp- arativ e raw size

(raw X 117 days)

synced

weekly

(17 weeks)

comp- arativ e raw size

(raw X 17 weeks)

synced ed monthl y

(4 months )

comp arativ e raw size

(raw X 4 months )

CentOS : base 7.4G 14G ~865.8 G 12G ~125.8 G 7.4G ~29.6G
CentOS : centos plus 979M 2.2G ~111.9 G 1.6G ~16.3G 1.1G ~3.8G
CentOS : extras 1.4G 1.7G ~163.8 G 1.6G ~23.8G 1.4G ~5.6G
CentOS : update s 6.5G 11G ~760.5 G 9.4G ~110.5 G 7.2G ~26G
EPEL: epel 13G 19G ~1.49T 17G ~221G 16G ~52G
Puppet Labs: puppet labs-p c1 2.3G 2.7G ~269.1 G 2.5G ~39.1G 2.4G ~9.2G
TOTAL 31G 50G 3.54 - 3.61T 43G 527 - 536.5G 36G 124 - 126.2G

1.1.1.3   Storage baselining (~7/3/2017 - 7/8/2017)

repo

raw size,

single day

synced

daily

(~35 days)

comp- arativ e raw size

(raw X 35 days)

synced weekly

(6 weeks)

comp- arativ e raw size

(raw X 6 weeks)

sync ed monthl y

(2 months )

comp- arativ e raw size

(raw X 2 months )

CentOS : base 7.4G 8.3G ~269G 7.5G ~44.4G 7.4G ~14.8G
CentOS : centos plus 979M 1.4G ~33.5G 1.1G ~5.7G 1.1G ~1.9G
CentOS : extras 1.4G 1.5G ~49G 1.4G ~8.4G 1.4G ~2.8G
CentOS : update s 6.5G 7.9G ~227.5 G 7.3G ~39G 7.2G ~13G
EPEL: epel 13G 15G ~455G 14G ~78G 14G ~26G
Puppet Labs: puppet labs-p c1 2.3G 2.4G ~80.5G 2.3G ~13.8G 2.3G ~4.6G
TOTAL 31G 36G 1,085 - 1,114. 5G 34G 186 - 189.3G 33G 62 - 63.1G

1.1.1.4   Puppet Implementation

  • modules:
    • ‘apache’, from Puppet Forge
    • ‘apache_config’, includes default config, firewall, and vhost
    • ‘pakrat’, includes base installation, wrapper, cron, and storage config
  • profiles
    • ‘pakrat’, includes pakrat module
    • ‘yum_server’, includes elements of apache_config
  • roles
    • ‘pakrat_yum_server’, uses profile::pakrat and profile::yum_server

1.1.1.5   Daily Ops

  • Note: This should be fleshed out a little more in the near-term, as necessary. If we elect to stick with Pakrat long-term then we can expand it even more.
  • When/how to run the Pakrat repo sync?
    • The Pakrat repo sync wrapper script is installed at /root/cron/pakrat.sh.
      • It depends on a pakrat.config file in the same directory.
    • The wrapper script is run daily by cron at 4:25pm.
    • The wrapper script can also be run manually.
    • Resiliency/details:
      • Repos will be given a pathname that ends with the Unix epoch timestamp so there should be no problem with running the script more than once per day.
      • The wrapper script will exit if it detects that it is already being run (just in case there are issues with Pakrat/Yum under the hood that would make simultaneous runs problematic).
  • How to add additional repos for Pakrat to sync?
    • Recommended procedures:
      • Establish the client configuration for the repository on the
        Pakrat-Yum server.
      • XXXXXXXXX
    • NOTE: If/when we start dealing more with GPG keys we will need to update this procedure slightly. See also LSST-1031 <https://jira.ncsa.illinois.edu/browse/LSST-1031>.

1.1.1.6   Improvements - High Priority

  • GPFS

    • overall:
      • size: Dan suggests ~50TB but look at baselining data from object-data06
        • synced daily for ~117 days leads to 50G of storage
      • location: Andy says just inside GPFS root for now; mkdir -p pakrat/production (just in case)
      • refactor Puppet code (apache_config) and Pakrat scripts to look for this location
      • implement GPFS code in Puppet to make sure it is mounted
    • add error checking into Pakrat script to handle case where GPFS is not available
    • after further consideration, probably best to back up to GPFS but still store on disk (what happens if GPFS is broken and our goal is to push out a patch...?)
  • create more verbose timestamp via wrapper so that we can run Pakrat multiple times a day if necessary

    • ran it twice in one day once (into the same snapshot) and encountered the errors described below for the elasticsearch-1.7 and influxdb repos
      • initially thought they were related to running Pakrat twice into the same output repo path but they are persisting on the regularly weekly runs and after adding the Unix epoch timestamp to the repo paths
  • fix the following issue: packages with unexpected filenames do not appear in local Pakrat-generated metadata:

    • the particularly metadata issue we are concerned about is as follows and (so far) only affects the elasticsearch-1.7 and influxdb repos:

      • results in errors in Pakrat output such as this:
        • Cannot read file: /repos/centos/7/x86_64/influxdb/2017-08-14/Packages/chronograf-1.3.0-1.x86_64.rpm
      • these errors correspond to the following scenario:

        • as listed in the *primary.xml metadata from the SOURCE repository

        • version/release info in ‘href’ parameter of ‘location’ key does not match various versions shown in ‘rpm-sourcerpm’ key:

          • rpm:sourcerpm <http://rpmsourcerpm> (hard to imagine this is relevant)
          • rpm:provides <http://rpmprovides> - rpm:entry <http://rpmentry> (e.g., rel=)
        • more specifically, the rpm name does NOT have a release segment in it

        • e.g., ‘elasticsearch-1.7.0.noarch.rpm’ is the RPM and it does not have a release in it’s name (e.g., *1.7.0-1.noarch.rpm) but SOURCE metadata indicates it is release -1:

          • <rpm:sourcerpm <http://rpmsourcerpm>>elasticsearch-1.7.0-1.src.rpm</rpm:sourcerpm <http://rpmsourcerpm>>
            <rpm:header-range <http://rpmheader-range> start=”880” end=”19168”/>
            <rpm:provides <http://rpmprovides>>
            <rpm:entry <http://rpmentry> name=”elasticsearch” flags=”EQ” epoch=”0” ver=”1.7.0”rel=”1”/>
            <rpm:entry <http://rpmentry> name=”config(elasticsearch)” flags=”EQ” epoch=”0” ver=”1.7.0” rel=”1”/>
            </rpm:provides <http://rpmprovides>>
        • Pakrat downloads the RPMs but does not include them in its local metadata (e.g., the only elasticsearch RPM that appears in Pakrat’s metadata is 1.7.4-1, because that is the only RPM that has a properly-formatted name, including the release)

          • thus they would be unknown to Yum clients going through Pakrat
    • possible fixes:

      • work with the vendor to release properly named RPMs
      • improve Pakrat to address this scenario (i.e., use the source metadata to fix its local metadata)
        • or is this an issue for the makerepo command
      • see if Katello has the same issue or not
      • mv or cp (or make symlinks for) the badly named RPMs after Pakrat downloads them; this may ensure that Pakrat includes them in its metadata
        • could probably script this fix, i.e., when Pakrat sync uncovers one of these errors, look for RPM without release in its name and copy it to the version that it is looking for so that the next run can include it in its metadata (perhaps even schedule another run of the repo at the end)
        • if we start cleaning out old “snapshots” and RPMs that are no longer used, then we may also have to build a workaround into that process
          • although it’s possible that the worst that would happen is that after a clean out, several badly named RPMs are redownloaded during the next Pakrat sync
          • using symlinks may help us here:
            • register the targets of all symlinks ahead of the cleanup
            • only remove a target if you are also going to remove the symlink
  • find and implement additional repos

    • search /etc/yum.repos.d using xdsh
    • search for the following terms in Puppet:
      • yum
        • adm::puppetdb
        • base::puppet
      • rpm
      • package
      • tar
      • wget
      • curl
      • .com
      • .edu
      • git
    • sync all repos in Pakrat
    • redo Puppet implementation for Yum clients

1.1.1.7   Improvements - Low Priority (e.g., only if we adopt Pakrat as a permanent solution)

  • Apache:
    • move vhost stuff into Hiera
    • move firewall networks into Hiera
    • should I eliminate apache_config module? move all Hiera references and ‘apache’ module references into profile?
  • Pakrat:
    • move config (.config file, cron stuff) into Hiera
    • is my approach for installing OK?
      • how to handle the dependency that fails to install initially?
    • improve verification/notification/fix when Pakrat sync is broken
      • fix postfix for cron (this is a larger issue)
      • are we sure that cron scheduling via crontab (as opposed to file-based /etc/cron.d scheduling) will result in emails for any output? yes
    • how to know which RPM versions are included in each snapshot?
      • look at *-primary.xml.gz / *-other.xml.gz; zcat piped to some xml parser?
    • document troubleshooting/monitoring for Pakrat

1.1.2   Katello

1.1.2.1   Overview

Katello <https://theforeman.org/plugins/katello/> is a plug-in for Foreman that is used to manage content, specifically local Yum and Puppet repositories. Katello is an integrated control interface and UI for Pulp and also Candlepin (RH subscription management). These products are all components of the RedHat Satellite platform.

1.1.2.2   Decision to Not Use Katello (October 2017)

Areas where it possibly offers benefits or at least different features as compared to the alternative (Puppet w/ Git and Pakrat, then Foreman or xCAT):

  1. Integrated change control for Yum and Puppet.
  2. Ability to schedule releases of content.
  3. GUI for managing Yum repo syncing and management.
  4. Flexibility in managing which RPMs are offered in Yum repos.
  5. Ability to discard old Yum RPMs.
  6. Manages RHEL subscriptions.
  7. Handles syncing from Foreman/Katello ‘master’ to Katello ‘capsule’ (a Foreman Smart Proxy with Katello content services):

Reasons we have elected not to investigate Katello further at this time:

  • Install and design seems overly complicated.
    • You must install Katello before installing Foreman, then run the foreman-installer with a special flag in order to install Foreman for use with Katello (link <https://theforeman.org/plugins/katello/nightly/installation/index.html>).
    • Creates the need to consult both Katello’s documentation and Foreman’s documentation for some considerations.
  • The above features don’t seem to offer anything critical that we need and which we haven’t already solved with Pakrat and our current Puppet/Git change control process.
      1. We already have integrated change control, via Git, for Yum and Puppet. In fact, it’s not clear whether or not Katello’s state can be captured by Git.
      2. We don’t really need to schedule the release of content. Our focus is more likely to be on scheduling patching or allowing a NHC process to do rolling patching.
      3. A GUI is probably not necessary. Our Git/Puppet work is done in the CL already. We will likely investigate the Hammer CLI for Foreman as well.
      4. This is a little tricky with Pakrat, although presumably we could set certain RPMs to the side and recreate/edit metadata.
      5. We can generate a manual process for discarding old Yum RPMs from Pakrat, although it might not be worth it. Space is cheap.
      6. We do not currently use RHEL.
      7. We could set up a Yum-Pakrat ‘master’ and have each Smart Proxy/Yum-Pakrat slave sync from it.

In summary, it doesn’t appear that the benefits of Katello outweigh the extra complications it seems to present.

1.1.3   Other Considerations

If we ever decide that Pakrat seems lacking in some area we should consider Pulp <http://docs.pulpproject.org/> (which is used by Katello) and also survey the landscape to see if anything else is available besides Katello.

1.2   Yum Client Config and Puppet Best Practices

1.2.1   Overview

  • All of our nodes must be configured to look at our managed Yum repos:
    • during or immediately after deployment (by xCAT, Foreman, etc.)
    • before any attempts by Puppet or other actors to go out and get an RPM by running Yum
  • We need to implement other things in Puppet in such a way that they only use Yum to get RPMs.
    • Anything that is not an RPM should either be built into an RPM and hosted locally, stashed in Git, or hosted and versioned in some other way.
  • All needed Yum repos should be managed (ideally Puppet would disable or uninstall unmanaged repos).

1.2.2   Current Practice

  • EPEL hostname is configured by a resource from the ‘epel’ module from Puppet Forge using Hiera
    • but where the the ‘epel’ module declared for each node? only in other modules that happen to be covering all nodes?
  • extra::yum was created to manage other repos (CentOS and Puppet Labs) using the ‘file’ resource
    • also turns off delta RPMs
  • profile::yum_client was created to utilize the extra::yum manifest
    • all roles reference this profile
  • various other modules install repos using the ‘yumrepo’ resource type or by installing RPMs that install repos

1.2.3   Improvements - High Priority (these are needed whether we use Pakrat or Katello)

  • Yum:
    • stop managing yumrepo files and use one or both the of the following:
      • ‘yum’ module (3rd-party Yum module)
        • this might only be needed to manage other aspects of Yum configuration (e.g., turn off delta RPMs, throw out old kernels, etc.), beyond which repos are present, enabled, etc.
      • ‘yumrepo’ resource type
    • put all repo URLs and other data in Hiera
    • manage all repos that are needed, pulling updates from Pakrat/Katello
    • will we need to install/manage GPG keys? which repos use them (EPEL does but this is handled)? how about Puppet Labs, etc.? how do we manage them?
      • GPG keys are often installed by the RPMs that also install the .repo files, no (e.g., ZFS <https://github.com/zfsonlinux/zfs/wiki/RHEL-%26-CentOS>)?
      • files are placed in /etc/pki/rpm-gpg (could be hosted in/installed by Puppet) and then installed using a command like “rpm –import /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux”
      • can the ‘yumrepo’ Puppet resource help with this? does the ‘yum’ Puppet module handle it better?
    • disable any unmanaged repos (or even uninstall files for unmanaged repos? which is better / easier)
      • can remove the xCAT provisioning repos after deployment:
        • xCAT-centos7-path0
        • xcat-otherpkgs0
      • the following repos can be removed from adm01:
        • centosplus-source/7
        • dell-system-update_independent
        • gitlab_gitlab-ce-source
    • document daily procedures for pointing Yum clients at specific snaphots (this is *probably* needed for Katello as well, but possibly not)
    • consider explicitly including the epel module in profile::yum_client
  • Other Puppet refactoring/updates:
    • anything that requires a pkg MUST also require the appropriate Yum resources / EPEL module, etc. so that any managed repo is configured first; update and document
  • xCAT (or Foreman)
    • install basic Yum config (CentOS, Puppet Labs, EPEL at a minimum); kind of a belt and suspenders thing, just in case some Puppet thing would otherwise sneak in an external RPM

2   Foreman

Purpose and Background

ITS is already using this (for non-LSST resources) for Puppet ENC and reporting.

Security is using this for their machine (largely VMs).

Investigation on LSST Test Cluster

Foreman is being installed on lsst-test-adm01. More info:

  • Foreman Feature Matrix and Evaluation <file:////display/LSST/Foreman+Feature+Matrix+and+Evaluation>
  • Foreman on test cluster <file:////display/LSST/Foreman+on+test+cluster>

Resources

project website: theforeman.org <https://theforeman.org/>

slideshare: Host Orchestration with Foreman, Puppet and Gitlab <https://www.slideshare.net/tullis/linux-host-orchestration-with-foreman-with-puppet-and-gitlab>

2.1   Foreman Feature Matrix and Evaluation

  • Overview <#ForemanFeatureMatrixandEvaluation-Overv>
  • Feature Matrix <#ForemanFeatureMatrixandEvaluation-Featu>
    • Deployment <#ForemanFeatureMatrixandEvaluation-Deplo>
    • BMC/firmware management <#ForemanFeatureMatrixandEvaluation-BMC/f>
    • Integration w/ Puppet <#ForemanFeatureMatrixandEvaluation-Integ>
    • Yum repo hosting/management <#ForemanFeatureMatrixandEvaluation-Yumre>
    • Distributed architecture and scalability <#ForemanFeatureMatrixandEvaluation-Distr>
    • Reliability <#ForemanFeatureMatrixandEvaluation-Relia>
    • Interface / workflow / ease of use <#ForemanFeatureMatrixandEvaluation-Inter>
    • Documentation and support <#ForemanFeatureMatrixandEvaluation-Docum>
  • Summary Evaluation <#ForemanFeatureMatrixandEvaluation-Summa>
  • Addendum 1: Possible end states <#ForemanFeatureMatrixandEvaluation-Adden>
  • Addendum 2: Other considerations for making a decision <#ForemanFeatureMatrixandEvaluation-Adden>

2.2   Overview

The purpose of this page is to help us enumerate the features of a Foreman-based solution vs. an xCAT-based solution to deployment and management of nodes. It may pay to consider a hybrid solution, namely a Foreman-based solution that also uses pieces of xCAT (or Confluent).

NOTE: We also need to indicate which of the listed features are requirements. Some may not be.

2.3   Feature Matrix

 Priority key:

3) requirement - must have this or we cannot deliver for the project and/or common/critical admin tasks would be hopelessly inefficient

2) very helpful to have - not a requirement but would increase admin efficiency considerably around a common task, decrease risk, or harden security further

1) somewhat helpful to have - not a requirement but would increase admin efficiency in a minor fashion

0) not needed - not necessary and of little usefulness, to the point that it is not worth the time

?) unknown

Feature Priority xCAT-oriented Foreman- oriented
Deployment    
DHCP for mgmt networks 3 Yes - tested Yes - tested
PXE & TFTP 3 Yes - tested both Dell and Lenovo

Preliminary yes

  • tested Lenovo (believe we had to change one or more BIOS settings to get machine to boot after install and/or to PXE boot)
  • test Dell
Anaconda installs for CentOS: kickstart, partition, etc. 3 Yes - meeting our needs so far

Preliminary yes

  • may need to test more customization

Support for other distros or OSes

  • we may need to support a handful of Windows machines (e.g., AD), likely VMs
???

Other NCSA clusters are using RHEL w/ xCAT.

Should support others, including (apparently) Windows.

  • anything to investigate / test? not now

Should support others, including (apparently) Windows (via vSphere templates)

  • anything to investigate/ test? not now

Deploys ESXi on bare metal

  • should be infrequent and only involve a relatively small number of machines
1

Yes, appears to install ESXi on bare metal (xCAT wiki)

  • investigate further/test

Yes, appears to install ESXi on bare metal (Foreman wiki)

  • investigate further/test

Local DNS for location-specif ic mgmt and svc networks

  • do we need this? or could we / must we rely on external DNS + /etc/hosts?
???

Yes, although we haven’t been using

  • investigate/
    test?
Yes - tested

Manage DNS hosted on external system (e.g., make local DNS authoritative or have mgmt system interact with external DNS via an API)

  • do we need this? probably not but might be nice for internal networks
1

Probably not.

  • investigate?
  • test?

Possibly...but needs investigation

  • investigate/ test?
Bare-metal deployment 3 Yes - tested Yes - tested

OS deployment to VMs

  • i.e., we have a VM that is manually provisioned or was provisioned using xCAT or Foreman, now we need to install an OS on it (e.g., via PXE + kickstart as w/ bare metal)
2

Yes, but not yet tested

https://sourcef orge.net/p/xcat /wiki/XCAT_Virt ualization_with _VMWare/

  • investigate PXE booting pre-provisio ned VMs
  • investigate other options?

Yes, but not yet tested

https://thefore man.org/manuals /1.15/#5.2.9VMw areNotes

  • investigate PXE booting pre-provisio provisioned VMs
  • investigate other options?
Provisioning of VMs within VMware 1

Yes, but not yet tested

https://sourcef orge.net/p/xcat /wiki/XCAT_Virt ualization_with _VMWare/

  • investigate integration with VMware to provision VMs
    • has access to what it needs and only what it needs?
    • other security concerns?

Yes, but not yet tested

https://thefore man.org/manuals /1.15/#5.2.9VMw areNotes

  • investigate integration with VMware to provisio n VMs
    • has access to what it needs and only what it needs?
    • other security concerns?
    • remote (other sites/ datacenters) provision via VMware? i.e., how does the Foreman master provision resources in a remote location? does it talk to a local vSphere which then handles the provision
Provisioning of cloud resources (e.g., AWS EC2, GCE, etc.) ???

Not really; the xCAT documentation recommends using Chef to interact with these resources.

luster/

Some support (manual provisioning with image-based deployment of the OS).

Diskless install / stateless nodes

  • do we need this?

    • 2017-12-18 Meeting Notes: Batch Production Services

    • LDM-144:

      need input into what stateless nodes, etc will look like

??? Yes, using in various NCSA clusters

Unsure...it seems possible (just PXE-boot from your desired boot image rather than an Anaconda-based install image) but there doesn’t seem to be any specific how-tos or tutorials on this and no sign that anyone asking has ever gotten detailed help with it

  • investigate/ test?
  • we could build an image with xCAT and boot nodes from it with Foreman

Node discovery (w/o interacting with switches)

  • we don’t install nodes all that often; it is possible to discover mgmt MACs via PXE log entries then configure BMCs from OS (on Dell via dtk, possibly also Lenovo)
  • on the other hand it’s not clear how efficient collaboratin w/ local boots on the ground will be for deployments in Chile
2

Yes, but haven’t pursued enough to get it to work

  • investigate/ test further?

Offers this feature (Discovery Plugin <https:/ /theforeman.org /plugins/forema n_discovery/9.1 /index.html> ), but not tested

  • investigate/ test?

Switch-based discovery (i.e., SNMP query of switches)

  • we don’t install nodes all that often; it is possible to discover mgmt MACs via PXE log entries then configure BMCs from OS (on Dell via dtk, possibly also Lenovo)
  • on the other hand it’s not clear how efficient collaboratin w/ local boots on the ground will be for deployments in Chile
1

Yes

  • investigate/ test further?

No?

  • investigate/ test further?

Configure Ethernet switch ports

  • not even sure NetEng would allow us to do this
0.5

Yes?

  • xCAT docs: switch management
  • investigate/ confirm?

No?

  • investigate/ confirm?
BMC/firmware management   Need to strong focus of xCAT.

investigate what the BMC Smart Proxy offers us.

Also investigate how we can use IBM/Lenovo Confluent (next-generatio n of xCAT) with Foreman.

Remote power 3 Yes - rpower
  • investigate SmartProxy BMC feature
  • investigate Confluent
Remote console and console capture 3 Yes - xCAT’s rcons and conserver
  • investigate SmartProxy BMC feature or other Foreman options
  • investigate Confluent
Manage BIOS settings out-of-band (ideally w/o reboot) and programmaticall y 3

Yes - Lenovo: xCAT’s pasu, but sometimes requires a reboot

Yes - Dell: must use racadm, probably with a wrapper

  • investigate SmartProxy BMC feature or other Foreman options
  • investigate Confluent

Install firmware outside of OS

  • on Lenovo we have not yet found a way to do this outside of the OS, we have to PXE boot the node
    • then again, what is the difference between installin from the Genesis kernel and installing from the booted OS
  • could be useful in general since it allows us to install firmware even if there are local disk problems or w/o modifying an install on local disks
3

Lenovo: supported via xCAT Genesis boot + Lenovo onecli

  • Dell?
  • investigate SmartProxy BMC feature or other Foreman options
  • investigate Confluent
  • adapt approach of xCAT’s Genesis boot approach and/or Industry’s firmware approach
Integration w/Puppet 2

Not integrated...

  • xCAT installs Puppet
  • BYO ENC
  • Puppet module for xCAT (out-of-date)

...However, the main thing missing right now is better Puppet reporting, although in theory this is already available in NPCF via centralized logging and is being looked at via our monitoring stack.

High level of integration with Puppet; provides:

  • Foreman is installed via/alongsid Puppet ENC

  • ENC - tested

  • Puppet logging

    • look closer at thisat
  • further investigate management of distributed Puppet infrastructur

    • Puppet Master

    • Puppet CA

      • cert signing and revoke
    • other high avail. considera- tions?

Yum repo hosting/ management 3

Pakrat:

    • we have a number of minor issues to investigate
    • implement syncing from master to remote servers
  • investigate Pulp?

Pakrat (or perhaps Pulp/Katello)

    • we have a number of minor issues to investigate
    • implement syncing from master to remote servers
  • investigate Pulp?

  • investigate Katello?

    • integrated with Foreman and likely handles syncing
    • Jake feels it’s not worth looking at right now

Distributed architecture and scalability

  • Will our nodes all have “public” interfaces or at least be able to “NAT out” to reach remote management resources?

Allows for distributed management via Service Nodes:

https://xcat-do cs.readthedocs. io/en/2.13.8/ad vanced/hierarch y/index.html

  • this seems like a somewhat nonstandard configuration

    (we don’t seem to be using at NCSA anyway)

  • handles subnetting for management networks via “setupforward

    setting”

  • but definitely does NOT seem set up for distribution across WAN

  • in other words, we’d need to have a full xCAT master for each datacenter (specifically for each management network)

Allows for distributed management via Foreman Smart Proxies:

https://thefore man.org/manuals /1.15/#1.Forema n1.15Manual

  • this is front and center with Foreman (it is described in the very first part of the Foreman manual)

Foreman Master controls deployments (DHCP, local DNS, TFTP)

  • verify usability with remote nodes that have no public address and no NAT capability (may be an artificial constraint; nodes should probably be able to connect outside for SSL CRL, etc.)
Central execution of remote deployments / central updating of node settings on remote deployment infrastructure (i.e., configure deployment settings on a master deployment server at NCSA to affect how a node deploys in Chile, handle things like DHCP, PXE, kickstart, etc.) 1

No, does not seem to support this out of the box (doesn’t support remote infrastructure at all)

  • investigate custom syncing / updating of xCAT configuration

    across WAN?

  • do be clear, w/ xCAT we’d need to log into a different xCAT master for each datacenter unless we do something custom

Yes, definitely handles updating node settings (stored in Foreman Master)

  • verify remote node deployment across sites: DNS, DHCP, PXE, kickstart w/ remote Foreman Smart Proxy
  • investigate client enrollment to local Puppet Master / Puppet CA during initial deployment (without node connectivity to Foreman Master)
    • local Puppet Master

Central management of remote deployment infrastructure (across WAN) (i.e., how do we keep remote deployment servers up-to-date)

  • we are likely to at least have local DHCP/ kickstart servers
2

No, does not seem to support this directly

  • investigate method of syncing content/ settings for deployment servers across WAN
  • investigate remote syncing of Puppet repos

A little bit...?

  • there are nice Puppet modules for managing Foreman Master / Foreman Smart Proxy that could help for updating server settings at least
  • but it does seem like at a fundamental level each Smart Proxy is installed and configured independently
  • investigate method of syncing content (e.g., images/source repos) for deployment servers across WAN
  • investigate remote syncing of Puppet repos

Initiate IPMI/firmware/h ardware management commands on remote machines from a central location (e.g., set to PXE, reboot, install firmware, configure BMC, etc.)

  • alternative is to log into a remote management server and execute there
2

No, does not support this out of the box

  • investigate custom setup for executing remote IPMI/BMC commands

Maybe...

  • investigate IPMI to initiate PXE (BMC Smart Proxy, etc.) across sites
  • investigate other remote IMPI/BMC commands (BMC Smart Proxy, etc.)

Distributed Puppet architecture

  • We only strictly need this if it’s determined to be necessary from a security perspective or if nodes have no “public” interface and cannot NAT out.

  • Puppet repos need to be pulled from same Git or synced from authoritative repo.

  • Can we have a

    centralized Puppet CA or do we need it to be local?

3 or 1

xCAT-based solution offers no assistance here but it should all be possible.

  • Local ENC or sync ENC between Puppet Masters.

A Foreman-based solution may make some of this easier:

  • If Foreman-based

    Puppet ENC works even when Foreman Master is unavailable, then that is a plus.

  • Foreman installer might make setup of Puppet CA vs Puppet Master somewhat easier (or at least offer a template).

  • We could further investigate what (if anything) Katello has to offer in this area, e.g., w/ Puppet repository/ module management.

Distributed environments can operating during WAN cut

  • Previously deployed machines can continue to operate.

    • This has little to do with which deployment

      solution we pick. The main considera- tion is, can machines continue to work even if Puppet cannot contact its master? As such it has more to do with whether or not we need a distributed

      Puppet architecture

    • Other considera- tions such as, do we need local DNS, NTP, SSL CRL, LDAP, etc. are directly not about deployment strictly speaking and more or less independent of which deployment system we choose.

  • Does not mean that we can initiate new deployments (how could we conceivably) although it’d be nice if one that was in progress would continue (hence 3 or 2).

3

Yes, but investigate Puppet (esp. ENC and CA).

  • With our current ENC it probably makes sense to have it located on each local Puppet Master (but w/ syncing from a central source) so that the local Puppet Master can continue functioning.

  • What about Puppet CA? If we have a single Puppet CA does a local Puppet Client-Puppet

    Master session work w/o being able to contact the remote Puppet CA?

Yes, but investigate Puppet (esp. ENC and CA).

  • Does Foreman’s Puppet ENC continue to operate during WAN cut?

    • If not, verify that we can instead use our own ENC rather than Foreman’s
  • What about Puppet CA? If we have a single Puppet CA does a local Puppet Client-Puppet

    Master session work w/o being able to contact the remote Puppet CA?

PXE over WAN

  • Not super useful as it still requires local DHCP. It would just save us needing to have local installation repo/image.
  • Does NOT include kickstart communication itself (next topic).
1 No, xCAT does not seem to support PXE over WAN.
  • Investigate further?

Local kickstart server or encryption of kickstart communication

  • Kickstart files often contain sensitive information so kickstart communication should be encrypted or remain local.
  • Encryption of kickstart communication may be possible (w/ RHEL, maybe CentOS) but it would be nonstandard w/ respect to both xCAT and Foreman.
3 Yes, each xCAT master would be local. Yes, Foreman has a “Templates” Smart Proxy feature that supports distributed sources kickstart.
Other security considerations (encryption of other other command data across WAN; authentication/ authorization; etc.) 3
  • Security of any custom remote IPMI solution we create, if applicable.
  • Security of any custom content (Puppet, Yum, images, etc.) syncing system we create, if applicable.
  • Overall Security vetting of whatever distributed setup we create.
  • Security of remote IPMI solution / BMC Smart Proxy.
  • Security of any custom content (Puppet, Yum, images, etc.) syncing system we create, if applicable.
  • Overall security vetting of Foreman.
Scalability 3

Yes, an xCAT-based solution should be able to scale to meet our needs.

  • To reiterate, we’d be using separate xCAT masters for each datacenter (or more) and then setting up distributed Puppet apart from/in addition to.
    • xCAT is further scalable within a datacenter via the use of service nodes.
  • We’ve heard that the console server may not scale (e.g., didn’t seem to work for iForge). Multiple xCAT masters could take care of that, however.

Yes, a Foreman-based solution should be able to scale to meet our needs.

  • NOTE: Sounds like Foreman, Puppet CA + Smart Proxy, Puppet ENC, & Reports on one machine one w/ 1,000 nodes could be pushing it a bit. Move to high availability at or before then is advised. (“HA case study”)
  • How does Foreman’s BMC Smart Proxy feature scale?
  • How large is Security’s fleet and what kind of load do they put on Foreman in terms of deploying machines?
Reliability   Yes, seems solid overall as evidenced by previous use at NCSA, including LSST.

Probably...

  • Ask Security.
  • Ask ITS.
Ability to backup and restore 3
  • See XCAT Installation Guide Backup and Restore section

    • Somewhat unclear if this encompass everything needed for DHCP, DNS, etc. (maybe just run makedhcp, makedns, makehosts, etc. after recovery). Definitely does not include /install directory.
    • See also “informa- ion on xCAT high availabil- ity” for other backup and storage considera- tions.
  • Test.

  • See Foreman Manual Backup, REcovery, and Migration section
  • Test.

High availability - is this necessary?

  • Production nodes should not depend on deployment and management infrastructu

    (see discussion about Puppet, above).

  • But if we are running local DNS servers that are tied to our management nodes then it matters.

3 or 1 Possible roadmap: information on xCAT high availability <h ttp://xcat-docs .readthedocs.io /en/stable/adva nced/hamn/index .html>

Possible roadmap: HA case study <https:// theforeman.org/ 2015/12/journey _to_high_availa bility.html>

  • A bit complicated though (involves memcache servers)?
Interface / workflow / ease of use      

Reporting/centr al logging

  • Note our monitoring/ logging stack should take care of this to a certain degree, but having it more integrated with the management/ deployment system could be nice.
1 Yes. Adequate logging including console logs. Yes. Also includes centralized reporting console for Puppet.

Support for change control: Git integration, rollback, and auditing procedures

  • Not sure to what degree the project will require this.
3 or 2

No Git-integration by default, but we could easily customize.

  • Use tabdump to import/ export

    tables to text and integrate w/ Git workflow. Possibly build wrapper commands to execute changes to tables.

No built-in undo.

Auditing may be less than desired since we tend to do everything as root in xCAT.

No Git-integration by default. Custom functionality may be harder to implement and enforce.

  • Export configs from DB to text periodically and import into Git?

No built-in undo.

Has decent auditing of actions performed via the Foreman master (likely includes CLI), and may display executing user effectively (esp. in web UI; not sure about CLI, etc.)

  • Further evaluate auditing via various interfaces?
Overall ease of use / efficiency 2
  • CLI is very responsive.
  • Table layout takes some time to understand.
  • Evaluate CLI, APIs, etc.
  • GUI has seemed rather slow so far but perhaps it can be made more responsive.

Specifically: ease of (re)deploying the OS on a node

(incl. Puppet ENC, NICs, disk partitioning)

2
  • We have this worked out pretty well for our nodes at NPCF.

  • xCAT-gen’ed

    kickstart files suit as fairly well for the most part. Disk provisioning is fairly smart.

  • Seems that it would require more direct manipulation of kickstart files, especially up front. It doesn’t appear that Foreman gives you as much “for free.”

Specifically: ease of configuring new hardware (i.e., modifying BIOS settings, other firmware, possibly “discovery” process)

  • We don’t install new hardware all that often.
1
  • Evaluate node discovery.
  • Can we install Dell firmware via Genesis boot as well?
  • Evaluate node discovery.
  • Can we install firmware in the discovery environment or via some PXE image?

Command-line interface (and other scriptable APIs)

  • Automation and integration depends on a CLI or some kind of API.
3 Extensive and fairly well developed CLI.
  • Investigate and evaluate CLI.
  • Evaluate other API(s).
GUI admin console 1

No...

  • Well, maybe Confluent.

Yes.

  • Has LDAP integration.
    • Can we secure with two-factor or just require SSH with X11 forwarding from a bastion node?

Granular permissions (levels of access, buckets of resources)

  • Not sure about project requirements around this.
  • Regardless, it might be nice to allow certain non-admins the ability to view the high-level configuratio (e.g., Puppet role/site/ datacenter/ custer) of some/all nodes.
  • Our monitoring stack might be able to provide some visibility into high-level configuration as well.
3 or 1

Not built in.

  • Create custom script to view Puppet ENC? Or build view of high-level config into Puppet monitoring stack?

Yes, but need to evaluate further if this is important.

  • Evaluate further?

Specifically: Allow developers to reprovision specific groups of machines

  • Jim Parsons is interested in this; not sure it’s an actual requirement nor how often it would be used.
  • Do we simply provide a pool of development machines w/ separate deployment management infrastructu to which (certain) developers have more/full access?
3 or 1

Not built in.

  • Create limited but privileged rebuild scripts for specific groups and/or targeted sudo config?

Seems to be built in.

  • Evaluate further?

Notifications

  • Monitoring stack should take care of this.
  • What kind(s) would we want?
1 No, does not seem to be built in.

Yes, seem to be built in.

  • Evaluate further?
  • Bill uses Foreman for ITS and remarks that Puppet report notification emails are only sent to the “owner” of the machine.
Documentation and support 2

xCAT documentation is decent (both comprehensive and specific, although there seem to be quite a few new features that are not yet documented).

xcat-user list on SourceForge has been reasonably useful.

Current vendor relationship with NETSource/Lenov o allows us somewhat privileged access to xCAT team.

NCSA is already using xCAT for Systems (Industry and ICC in addition to LSST) and has a few team members with extensive experience with xCAT.

Foreman documentation is decent (it is a really big product and the documentation sometimes lacks specificity and/or concrete examples).

foreman-users Google group had about 2.5 times more messages than xcat-user list in a representative time frame (the Google group is now defunct; use .

Using RedHat Satellite (Foreman + Katello & more) might get us support but would almost certainly require using RHEL and would almost certainly require additional cost.

NCSA is already using Foreman for ITS (basic UI/Puppet reports & ENC only, so far) and Security (more extensive use, including Katello). Security’s person with most experience recently left.

2.4   Summary Evaluation

Both products—xCAT and Foreman—or a combination of these products would seem to meet our needs at a fundamental level. In any case we’d be using the product(s) for IPMI functions, (possibly) bare-metal discovery / VMware provisioning, and PXE-boot OS installs with as minimal a configuration as possible with Puppet handling as much of the configuration as possible.

Foreman is a newer tool but seems to have broader functionality and appears to have a larger user community. It also appears to be a more complex tool, which could lead to greater management overhead.

Foreman also appears to offer better out-of-the-box support for a distributed architecture with centralized control and secure communication between the deployment servers. On the other hand, pursuing a more centralized point of control would likely push us more strongly towards high availability of the central/master resources, which could introduce even more complexity/management overhead.

The actual design and implementation of our solution, or future shifts in our design/implementation, may be influenced by a few outstanding questions about project requirements and architecture (e.g., will we need to support stateless nodes? will we need to manage DNS with our solution? will we need to offer role-based access to admins or the capability for non-admins to view/update configuration? will we need to support cloud resources?)

2.5   Addendum 1: Possible end states

(1) Use current NPCF model (xCAT for deployment and IPMI functions, Puppet for configuration management, Pakrat for Yum repo management, new monitoring stack, possibly Confluent for IPMI functions)

(2) Same + use Foreman for Puppet integration (ENC, reporting, certificates) alongside xCAT, etc.

  • It may not be possible for xCAT and Foreman Master to live on the same server. By default a Foreman Master includes TFTP server by default as does xCAT and their settings according to /etc/xinitd.d/tftp seem to conflict. We could ask online to see if it is possible to install a Foreman Master without TFTP. Also see Foreman Manual for customization of TFTP <https://www.theforeman.org/manuals/1.16/index.html#4.3.9TFTP>.
  • If pursuing (2) it might make sense to have general admin/xCAT/IPMI/bastion functions on one node and Foreman/Puppet (CA, Master, ENC, reporting)/GitLab on another node.
  • Our GitLab on lsst-adm01 uses PostgreSQL as does Foreman (by default). Handle Foreman + GitLab with care.

(3) Same + use Foreman for node deployment (DHCP/PXE, kickstart, possibly DNS) instead of xCAT (still use xCAT/Confluent for IPMI functions, Pakrat for Yum repo management).

(4) Same + use Foreman BMC Smart Proxy for IPMI functions (still use Pakrat for Yum repo management)

NOTE: (2), (3), and (4) also offer the possibility of using Katello for Git/Puppet branch management and/or Yum repo management.

  • We could also look at using Katello components (esp. Pulp) directly w/ (1), (2), (3), or (4).

2.6   Addendum 2: Other considerations for making a decision

  • We would save some time up front by going with (1) because we’re basically already there with NPCF.
    • There are quite a few improvements we should make, however.
    • And we should rebuild our current xCAT/Puppet master/management node (lsst-adm01) at some point. Do we want to rebuild more-or-less as-is or rebuild with Foreman, whether (2), (3), or (4)?
  • By sticking with (1), merging NCSA 3003 into a shared environment can be a stronger focus more immediately (and there are many benefits to getting this done sooner rather than later).
    • Standing up another xCAT master for NCSA 3003 would take very little time and would offer a good opportunity for refining our backup/rebuild procedures for our xCAT master at NPCF.
  • (2), (3), and (4) could be pursued later on (with more awareness of both project requirements and of Foreman) and also pursued incrementally, e.g., (1)->(2)->(3)->(4)->....