Hilary Oliver, Oliver Sanders, Dave
Matthews 3rd ENES Workshop on Workflows 13 September
2018
Cylc development
is part-funded by ESiWACE
Exascale Workflow?
Cylc (Re-)Introduction
Exascale Challenges
Meeting the Challenges of Exascale
Introduction to Rose
Exascale Workflow
What do we mean by this???
Hilary Oliver NIWA (NZ)
Exascale Computing
early-mid 2020s?
Aurora - Argonne Lab USA, Cray, 2021?
1 exaflop = a quintillion (a billion billion) flops
approx. processing power of the human brain?
challenges?
hardware (and power 40MW = $US 40M/year)
software scaling
workflow?...
Exascale Workflow?
(we've got this covered)
even if the primary purpose of these machines is massive
exascale models, much of their time will be spent running many
smaller models (e.g ensembles in our business) and processing
vast amounts of data generated by them.
TERMINOLOGY: "suite" == "workflow"
workflow automation must be at the heart of utilizing these massive
rescources
(Credit Keir Bovis: Met Office exascale programme scope)
Cylc (Re-)Introduction
The Cylc Workflow Engine
Hilary Oliver NIWA (NZ)
cycling workflows
what Cylc is like
Cycling workflows:
what are they
why do we need them?
Then a quick overview of what Cylc is like.
domain of applicability
workflow definition
architecture
user interface
repeat cycles...
...there's inter-cycle dependence between some task...
...which is technically no different to intra-cycle
dependence...
...no need to label cycle points (boxes) as if they have global
relevance...
... so you can see this is actually an infinite single workflow
that happens to be composed of repeating tasks
The key to an animation of how cylc manages such an infinite workflow.
Cylc's dynamic cycling mode.
What's cycling needed for?
clock-limited real time forecasting
short chunks of a long simulation
steps in some iterative process (e.g....)
processing datasets as concurrently as possible
pipelines
dynamic cycling is not strictly needed for small, short
workflows.
historically achieved (NWP) with sequential whole cycles.
Catch-up from delays much faster if inter-cycle dependence is
explicitly managed.
aside: pipelines
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef B2 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef C3 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
class A A1
class B B2
class C C3
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#fcf,stroke:#b8b,stroke-width:5px;
classDef B2 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef C3 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
class A A1
class B B2
class C C3
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#cff,stroke:#8bb,stroke-width:5px;
classDef B2 fill:#fcf,stroke:#b8b,stroke-width:5px;
classDef C3 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
class A A1
class B B2
class C C3
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#ffc,stroke:#bb8,stroke-width:5px;
classDef B2 fill:#cff,stroke:#8bb,stroke-width:5px;
classDef C3 fill:#fcf,stroke:#b8b,stroke-width:5px;
class A A1
class B B2
class C C3
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef B2 fill:#ffc,stroke:#bb8,stroke-width:5px;
classDef C3 fill:#cff,stroke:#8bb,stroke-width:5px;
class A A1
class B B2
class C C3
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef B2 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef C3 fill:#ffc,stroke:#8bb,stroke-width:5px;
class A A1
class B B2
class C C3
%% This is a comment in mermaid markup
graph LR
A(process A)
B(process B)
C(process C)
A-->B
B-->C
classDef A1 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef B2 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
classDef C3 fill:#d7d7d7,stroke:slategray,stroke-width:5px;
class A A1
class B B2
class C C3
[scheduling]
[[dependencies]]
[[[P1]]]
graph = """A => B => C
A[-P1] => A
B[-P1] => B
C[-P1] => C"""
[scheduling]
[[dependencies]]
[[[P1]]]
graph = """A => B => C"""
Not currently well suited to very large numbers of small
quick-running jobs (later...)
A workflow is primarily a configuration of the workflow engine,
and a config file is easier for most users and most use
cases than programming to a Python API. However, ...!
3 cycles of a small deterministic regional NWP suite. Obs
processing tasks in yellow. Atmospheric model red, plus DA and
other pre and post-processing tasks A few tasks ... generates
thousands of products from a few large model output files.
... ~45 tasks (3 cycles)
... as a 10-member ensemble, ~450 tasks (3 cycles)
... as a 30-member ensemble, ~1300 tasks (3 cycles)
Cylc currently scales to 10ks of tasks (with caveats)
Tested sans GUI at 50k tasks (need lots of RAM)
Fine for "normal" large ensemble suites
BUT a problem has emerged: incremental forecast-time
ensemble verification and product-generation:
multi-dimensional (nested) cycling
many small tasks ... with non-trivial dependencies
automatic reconfiguration of huge suites
(caveats: amount of config per task; amount of runahead; server
resources; GUI, esp. graph view)
Met Office IMPROVER: 100k jobs per cycle.
Handling this kind of suite better leads to applicability in other
domains: massive numbers of small tasks PLUS complex workflow
and production capabilities.
1 member, 2 files/fc-hour for 10 fc-hours,
1 tasks/file = 1 x 2 x 10 x 1 = 20 jobs per cycle ...
30 members, 7 files/fc-hour for 10 fc-days, 5 tasks/file ... 30 x 7 x (24x10) x 5 = 250,000 jobs per
cycle!
5 tasks per file: i.e. each of the (expanded) pink nodes is a sub-workflow
Problem is the parameterized "nested cycle" over forecast hour.
To handle this NOW, we need:
a lot of bunching of small jobs, or
sub-suites: super-efficient dynamic cycling, but:
30+ suites (per cycle) instead of 1 suite!
log directory housekeeping?
start-up, kill, restart, failure recovery etc.?
either way: too many log files - 5+ per job
(FS and resource manager also bottlenecks now)
NOTE: current HPC infrastructure can't handle this yet
either
these tasks totally dominate the workflow
cylc can have multiple "top-level" dynamic cycling sequences in
one suite, but nested cycles still have to be parameterized
BUT actually it's worse than this!
Real ensemble post-processing systems may:
aggregate output from multiple model suites
have complex verification + product-generation workflows that are
not bunching-friendly
require automatic (or even dynamic) structural re-configuration
to swap different product modules (e.g.) in and out
not bunching friendly: complex dependencies within the bunch
Plans to address these issues:
general efficiency gains
event-driven "spawn-on-demand" scheduling
scheduler kernel for light-weight suite-in-a-job without
all the job log files of current sub-suites
foo:
out: a
bar:
in: a
out: b
baz:
in: a, b
out: c
Compute resource dependency?
Others?
Scaling With Dependencies
Cylc can currently scale to tens of thousands of tasks and
dependencies
But there are limitations, for example:
Many to many triggers result in NxM dependencies
Cylc should be able to represent this as a single dependency
The scheduling algorithm currently iterates over
a "pool" of tasks.
We plan to re-write the scheduler using an event
driven approach.
This should make Cylc more
efficient and flexible model
solving problems like this.
Kernel - Shell Architecture
Working towards a leaner Cylc we plan to separate the codebase into
a Kernel - Shell model
Shell
Kernel
User Commands
Scheduler
Suite Configuration
Job Submission
Batching Jobs
Combining multiple jobs to run in a single job submission.
Arbitrary Batching
A lightweight Cylc kernel could be used to execute a workflow within
a job submission.
The same Cylc scheduling algorithm
No need for job submission
Different approach to log / output files
Future Challenges
Container technology
Computing in the cloud
File system usage
Introduction to Rose
Storage and configuration of complex Cylc workflows
David Matthews Met Office (UK)
History
Development started late 2011 around the time we chose cylc as our workflow tool
Motivation
GUI for application configuration
Suite version control & identification
Enable support of multiple workflow tools
Cylc may not have worked out?
Collaborators might require other tools?
Provide essential features which were outside of the Cylc vision
GUI for application configuration
Suite version control & identification
Enable support of multiple workflow tools
Cylc may not have worked out?
Collaborators might require other tools?
Provide essential features which were outside of the Cylc
vision
Tools which have moved / will move into Cylc
Log file retrieval
Email notifications
rose bush - log file viewer
rose suite-run - suite installation
rose host-select -> cluster support
rose bunch - run multiple small tasks in one job
Housekeeping
Tools which will remain in Rose
rose edit / task-run - Application configuration
rosie - suite storage and discovery
(plus a fewer smaller, more specialised utilities)
Application configuration: The problem
Tasks may be complex to define, requiring
setting environment variables
input files (containing namelists or otherwise)
scripts
etc
Rose applications provide a convenient way to encapsulate all of this
configuration, storing it all in one place to make it easier to handle
and maintain.
In suite.rc:
[runtime]
[[hello]]
script = rose task-run
This will run an app config named "hello" (by default)
An app config is a directory containing a file named rose-app.conf
Simple ini style format (similar to Cylc suite.rc)
Describes:
command to run
environment variables
input files containing namelists
files to install (copy, link, export from Subversion)