System sets and parallel executor v2 #1144

Ratysz · 2020-12-24T16:06:25Z

It's a wall of text, I know - you don't have to read all of it, its purpose is to clarify the details should questions arise.

History

Prior discussion: this comment and on.

This PR builds on top of the incomplete SystemSet implementation branch that didn't make it into 0.4; it was split from schedule-v2 branch before the merge.

The branch introduced system sets: SystemStage is now made up of one or more SystemSet, each has its own run criterion and contains the systems whose execution said criterion specifies.

It also moved ShouldRun to schedule's level in the file hierarchy, and implemented run criteria and related methods as a reusable struct.

The rebase that was required to make this branch mergeable was messy beyond belief, so I opted to merge in the master instead. It's still rather messy, but at least the commits are coherent.

Glossary

exclusive system - system that requires exclusive (mutable) access to entirety of world and/or resources. The new implementation automatically executes these sequentially, either at the start or end (not fully implemented) of the stage.
parallel/parallelizable system - any system that doesn't require exclusive access. These are executed in the middle of the stage, as many as possible at once.
thread-local system - parallelizable system that accesses one or several thread-local resources (the ones in !Send storage).
thread-agnostic system - parallelizable system that doesn't access any thread-local resources.

Collateral changes

2021-01-23: this is not exhaustive, and outdated - the list does not cover commits made since original post.

Necessary (and not so much) changes that aren't directly related to the implementation:

Renamed access::Access to ArchetypeAccess, made it private. No particular reason.
Removed ThreadLocalExecution, System::thread_local_execution() - this information is now encoded in ::archetype_component_access(), ::resource_access(), and new ::is_thread_local().
Extended TypeAccess with support for "reads/writes everything" to facilitate the above.
Added CondensedTypeAccess.
Moved behavior of thread_local_func (command buffer merging) and similar fields from parallelizable System implementors into their run_exclusive() implementation.
Renamed ThreadLocalSystemFn and the module into_thread_local to ExclusiveSystemFn and into_exclusive respectively, to differentiate systems with this access pattern from actually thread-local systems.
Implemented IntoSystem for FnMut(&mut World) and FnMut(&mut Resources), both result in ExclusiveSystemFn. No particular reason.
Implemented ThreadLocal system parameter.
Implemented system insertion methods with labels and dependencies (with string labels for now) for SystemSet and SystemStage.
Added ShouldRun::NoAndLoop.
Renamed System::update() to System::update_access(), for clarity.
Changed system sets to store systems in NonNull rather than Box. Requires auditing, and a sanity check.

Algorithm

2021-01-23: this is slightly outdated now, due to changes to exclusive systems and availability of local-to-thread tasks.

Reading is not required. It's here because people were curious and picking apart someone else's code is unpleasant, annotated or not.

The idea is similar to those I used in yaks and this prototype. I also wrote an entire long thing about the topic.

Abridged:

Evaluate run criteria of system sets.
If any of the sets have changed, rebuild the cached scheduling data.
Run exclusive systems that want to be at the start of stage.
If world's archetypes were changed since the last time parallel systems were ran, update their access and the relevant part of scheduling data.
Enter outer compute pool scope.
Prepare parallel systems - spawn tasks, reset counters, queue systems with no dependencies, etc.
Try running a queued thread-local system. Queue its dependants if this was the last system they were waiting on.
Enter inner compute pool scope.
Try starting some queued thread-agnostic systems.
See if any thread-agnostic systems have finished, queue their dependants if those were the last systems they were waiting on.
Exit inner scope.
If there are any queued or running systems, continue from step 7.
Exit outer scope.
Merge in command buffers.
Run exclusive systems that want to be at the end of stage.
Re-evaluate run criteria, continue from step 3 if needed.

Full(ish):

As needed, stage calls <ParallelSystemStageExecutor as SystemStageExecutor>::execute_stage() with &mut [SystemSet] that contains the systems with their uncondensed access sets, &HashMap<SystemIndex, Vec<SystemIndex>> that encodes the dependency tree, &mut World, &mut Resources.
Run criteria of the sets are evaluated. If no sets should be ran, algorithm returns early; if no sets should be ran yet any number of them request their run criteria to be reevaluated, it panics (to avoid looping infinitely).
If any of the system sets have had new systems inserted, pre-existing scheduling data is discarded, and fresh data is constructed:
1. Any cached data is cleared.
2. Systems of sets are iterated, collecting all distinct (not "access everything") accessed types of parallel systems into a pair of hash sets (one for archetype-components, another for resources). If the world's archetypes were also changed, systems' access is updated before collecting its types.
3. Hash sets of types are converted into vectors. Bitsets are resized to be able to fit as many bits as there are parallel systems.
4. Systems of sets are iterated again, sorting indices of systems with exclusive access into list of exclusive systems, generating partial parallel scheduling data by "condensing" access sets to bitsets (the bitset is a mask that produces the systems' access set when applied to the vector of all types accessed by the stage). At the same time, a map of SystemIndex to the respective system's usize index in the parallel systems vector is populated.
5. Dependencies map is iterated, inserting dependants' indices into their dependencies' parallel scheduling data, using the map from the previous step.
All exclusive systems that want to run at the start of stage (currently all exclusive systems) are executed, if their parent system sets' criteria call for it.
If the archetypes were changed since the last time parallel systems were ran, said systems' accesses are updated, and their archetype-component bitsets are recondensed (types are collected and sets are converted into bitset masks).
Compute pool scope is created and entered.
Each parallel system is prepared for execution:
1. Safety bit is reset.
2. Whether the system should be ran this iteration (the result of its parent system set's run criterion evaluation) is cached into a bitset.
3. If the system should be ran, its task is spawned (if it's not a thread-local system) into the scope, and it's queued to run if it has no dependencies, or has its dependency counter reset otherwise.
Running a thread-local system on the main thread is attempted:
1. Queued systems are filtered by thread-local systems (bitset intersection).
2. The first system out of those that passes the parallelization check (see below) is executed on the main thread.
3. Its index is removed from the queued systems, and any of its dependants that should run this iteration have their dependency counters decremented by one.
4. Dependants that have their counter reach zero are queued.
A new inner compute pool scope is created and entered, and thread-agnostic system execution task is spawned into it:
1. Queued systems are filtered by not thread-local systems (bitset difference).
2. Any systems that pass the parallelization check (see below) are signalled to start via a channel and are marked as running.
3. "Active access" sets representing the union of all running systems' access sets are extended with the newly running systems' sets.
4. Newly running systems are removed from the queue.
5. If there are any systems running, wait for at least one of them to finish, unmark it as running.
6. Any of now finished systems' dependants that should run this iteration have their dependency counters decremented by one.
7. Dependants that have their counter reach zero are queued.
8. Active access sets are rebuild with access sets of all still running systems.
When the inner scope is done, if there are any running or queued systems, continue from step 8. Otherwise, exit outer scope.
Command buffers of parallel systems (that should have ran this iteration) are merged in by running their exclusive part.
All exclusive systems that want to run at the end of stage (currently none of exclusive systems) are executed, if their parent system sets' criteria call for it.
System sets' run criteria that requested to be checked again are reevaluated. If no sets should be ran, algorithm returns; if no sets should be ran yet any number of them request their run criteria to be reevaluated, it panics (to avoid looping infinitely). Otherwise, continue from step 4.

Parallelization check

System's condensed resource access is checked against active resource access, ditto for archetype-component access. "Checked against" here means "if one reads all, other can't write anything, otherwise check if bitsets of writes are disjoint with bitsets of reads and writes". That's it.

Notes

SystemIndex is a pair of usize indices: system's parent set and its position in it.
Labels used by the stage are not used by the executor, it instead relies solely on the SystemIndex mapping.
Hot-path machinery uses usize indices (corresponding to systems' position in the vector of parallel scheduling data) and bitsets.
Parallel systems can request immutable access to all of world and/or resources and are seamlessly scheduled along others.
Thread-agnostic systems are not required to finish in the same inner scope that starts them.
Systems belonging to sets whose criteria disable them are simply never queued, and their tasks (for thread-agnostic systems) are not spawned.

Things to do

Numbered for discussion convenience:

Audit unsafe code: ThreadLocal system parameter implementation, NonNull use in SystemSet.
Targeted tests of new behaviors and invariants. Smoke tests for future development convenience.
~~Dependency tree verification. I think the best place to do it is in SystemStage::run_once(), where all the necessary information is first seen in one place.~~ mostly done, still refactoring
Exclusive systems dependencies and sorting. Should be done at the same time as point 3. There are stubs in the implementation that allow executing exclusive systems at either start of stage or end of stage, but no way to specify when such system would like to be ran. I think this could be handled by dependencies: exclusive systems that depend on parallel systems are automatically shoved to the end of stage, and vice versa; conflicts result in a panic during tree verification. partially irrelevant due to full exclusive/parallel schism, sorting is tolopogical now but will
~~Serial stage executor. As of now, it's barely functional(?) and does no dependency-based sorting whatsoever; this should probably be handled after point 4.~~ mostly done, still refactoring
~~Decide if we want the safety bit. I've never seen that check fail, but it's about as lightweight as it can be, so I say we keep it.~~ removed, it became completely trivial with introduction of local-to-thread tasks
Decide how to handle the run criteria infinite loop (when all criteria evaluate to a combination of ShouldRun::No and ShouldRun::NoAndLoop). Current strategy is to panic, could use a more meaningful panic message.
~~Decide if we want to sort parallel systems' command buffer merging in any way. Currently, it's not defined, but should be ordered by set then insertion order into set.~~ it's topologically sorted now, exploiting a side-effect of validating the dependency graph
~~Decide if we want to merge in parallel systems' command buffers before or after end-of-stage exclusives are run. Currently, it's before.~~ we have both options now, thanks to point 12
Consider "optional dependencies". Right now, if system A depends on system B and B gets disabled by a run criterion of its parent system set, A will not run and there is no way to make it run without running B. This should not be the case: execution order dependencies should only specify execution order and never execution fact. all dependencies are now "soft", disabling a system does not disable its dependants; considering supporting "hard" dependencies
~~Consider simplifying the TypeAccess implementation by merging AccessSet into it. See CondensedTypeAccess for comparison.~~ done in c4e8261
~~Consider better API for system insertion into stage/set. Related: Improved system insertion syntax #1060.~~ adopted the "system descriptor" builder pattern
Consider system labels that aren't &'static str.
~~Plumb the labels/dependencies API throughout the hierarchy; best done after point 12.~~
Actually make use of system sets in StateStage, etc.
Tighten API in general: currently some of the things that have no business being public are public.
Minor refactors and internal tweaks, mostly for my own peace of mind regarding maintainability.
Documentation, examples. I haven't done any public documentation, but I think there's plenty of internal documentation of the executor itself.
Propagate the newly available/required patterns to the rest of the engine and examples, benchmark. This should be done after the API-related concerns are addressed, of course. For what it's worth, the examples I've tried running so far all have Just Worked.

This should cover vast majority of things, but I'm bound to have missed something. I will slowly start on some of the items in the todo list, however, most of them will benefit from discussion at this point, and some can be tackled by someone else and/or in a different PR.

Happy holidays!

Renamed `access::Access` to `ArchetypeAccess`, made it private. No particular reason. Removed `ThreadLocalExecution`, `System::thread_local_execution()` - this information is now encoded in `::archetype_component_access()`, `::resource_access()`, and new `::is_thread_local()`. Extended `TypeAccess` with support for "reads/writes everything" to facilitate the above. Renamed `ThreadLocalSystemFn` and the module `into_thread_local` to `ExclusiveSystemFn` and `into_exclusive` respectively, to differentiate systems with this access pattern from actually thread-local systems. Implemented `IntoSystem` for `FnMut(&mut World)` and `FnMut(&mut Resources)`, both result in `ExclusiveSystemFn`. Just because. Probably in vain. Rewrote `SerialSystemStageExecutor` to account for `ThreadLocalExecution` removal. Implemented `ThreadLocal` system parameter. Implemented mk1 label and dependencies methods on `SystemSet` and `SystemStage`. Added `ShouldRun::NoAndLoop`.

Renamed `System::update()` to `System::update_access()`, for clarity. Plumbed labels and dependencies into the stage executor, via `SystemStage::labels` (a stage-global map of label to system index), `SystemSet::system_label()`, and `SystemSet::system_dependencies()`. Rewrote `SerialSystemStageExecutor` to support exclusive systems at start of stage as well; unfinished.

Began implementation; doesn't do anything yet because the pathway from data it gets to data it wants is not implemented. Changed system sets to store systems in `NonNull` rather than `Box`. Requires auditing and a sanity check.

Minor cleanup. Partially implemented run criteria handling.

Minor refactor. Fixed a future bug in not yet implemented `can_start_now()`. Accounted for run criteria for parallel systems: if a system shouldn't run in this iteration, its task is not spawned and it's never queued to run.

Added `CondensedTypeAccess` and related machinery. Moved the dependency resolution burden from executor to the stage; verification is not yet implemented. Systems with no dependencies are immediately queued to run at the start of each iteration. Implemented rebuilding scheduling data and updating access information. Implemented `can_start_now()`. Running tests violates access, right now.

# Conflicts: # crates/bevy_app/src/app.rs # crates/bevy_app/src/app_builder.rs # crates/bevy_asset/src/lib.rs # crates/bevy_core/src/time/fixed_timestep.rs # crates/bevy_ecs/Cargo.toml # crates/bevy_ecs/src/lib.rs # crates/bevy_ecs/src/schedule/mod.rs # crates/bevy_ecs/src/schedule/stage.rs # crates/bevy_ecs/src/schedule/stage_executor.rs # crates/bevy_ecs/src/schedule/state.rs # crates/bevy_ecs/src/system/into_system.rs # crates/bevy_ecs/src/system/into_thread_local.rs # crates/bevy_ecs/src/system/system.rs # crates/bevy_ecs/src/system/system_chaining.rs # crates/bevy_ecs/src/system/system_param.rs # crates/bevy_input/src/lib.rs # crates/bevy_render/src/lib.rs # crates/bevy_render/src/render_graph/system.rs # crates/bevy_scene/src/lib.rs # crates/bevy_transform/src/hierarchy/hierarchy_maintenance_system.rs # crates/bevy_transform/src/transform_propagate_system.rs # crates/bevy_ui/src/lib.rs # crates/bevy_ui/src/update.rs # crates/bevy_winit/src/lib.rs # examples/2d/texture_atlas.rs # examples/ecs/ecs_guide.rs # examples/ecs/fixed_timestep.rs # examples/ecs/state.rs

cart · 2021-02-07T01:23:30Z

Yup this seems correct / works for me:

I'm pushing my changes now. If someone else can verify that this works for them too, that would be great.

cart · 2021-02-07T01:32:54Z

Ah wait I just missed the ambiguity report. Everything still makes sense

alice-i-cecile · 2021-02-07T01:58:34Z

Testing again on your branch @cart (commit hash: e23c8a8). I'm getting fixed behavior over 10 tries each of the breakout and ui examples :)

Ratysz · 2021-02-07T09:30:44Z

Can confirm, before e23c8a8 the score text sometimes is not visible (strange that I didn't notice that before), after it's always visible (for N=25, anyway).

cart · 2021-02-07T19:50:27Z

Alrighty I think it is time to merge this. I do think its worth investing some time in optimization. Its a bit of a shame that most "simple" games like "breakout" (and "simple" benchmarks) will regress in perf. Breakout frame time regresses by ~0.1 milliseconds (0.6% of our total frame time budget at 60fps).

Ratysz · 2021-02-07T20:12:12Z

Simple games should probably be using SystemStage::single_threaded() anyway. Why would anyone want to parallelize one ball and one paddle, there's no way that code is heavier than multithreading machinery.

cart · 2021-02-07T20:15:51Z

In theory: yes, but in practice people will use the default (which should almost certainly be parallel or we're doing something wrong).

Ratysz · 2021-02-08T21:29:07Z

I've added more tests to ambiguity detection. Couldn't yet fix the bug we found earlier, so CI should fail. If someone wants to take a stab at it the relevant function is find_ambiguities in stage.rs. Also please verify that the relevant tests themselves aren't broken, because I'm not certain of anything at this point.

Ratysz · 2021-02-09T12:52:19Z

It should be fixed now - it's passing the tests (some are pretty convoluted). It's starting to remind me of matrix operations more and more; a path towards further optimisations, perhaps.

crates/bevy_ecs/src/schedule/stage.rs

System sets and parallel executor v2

cart and others added 30 commits December 5, 2020 12:12

new year new schedule

c18476a

everything works!

234e766

fix stuff

2f6c231

state work

474dc05

states kind of work?

40e8953

run_criteria, FixedTimestep, and fix stages

0cdcfd9

remove System::is_initialized

e4d7ab5

simpler timestep

e370da0

IntoStage

bf9c513

remove startup_schedule in favor of stage::STARTUP. add RunOnce

7699ec4

clippy and fmt

dbe3748

initial state example

cadd846

renames

2611cc5

run StateStage until there are no queued changes

509f5f5

state example

000dd7e

fmt

9b382ea

adapt to new clippy rule

061f0c4

fix tests

5c435be

rename System Input and Output to In and Out

ccad3c6

partial SystemSet impl (intentionally broken)

465f59e

Parallel executor implementation, pt 1:

ffe34e7

Began implementation; doesn't do anything yet because the pathway from data it gets to data it wants is not implemented. Changed system sets to store systems in `NonNull` rather than `Box`. Requires auditing and a sanity check.

Parallel executor implementation, pt 2:

26afcfe

Minor cleanup. Partially implemented run criteria handling.

Parallel executor implementation, pt 3:

0e75a60

Minor refactor. Fixed a future bug in not yet implemented `can_start_now()`. Accounted for run criteria for parallel systems: if a system shouldn't run in this iteration, its task is not spawned and it's never queued to run.

Fixed access violation.

4a6db18

Revert deletion via merge screwup...

52f692f

Fixed more merge artefacts, implemented Stage::initialize().

1746bb1

fix flex system ambiguity

e23c8a8

Improved ambiguity detection tests.

c69579f

Fixed ambiguity detection bug.

d929da3

cart reviewed Feb 9, 2021

View reviewed changes

crates/bevy_ecs/src/schedule/stage.rs Outdated Show resolved Hide resolved

Applied suggested optimization.

69a55aa

cart merged commit d5a7330 into bevyengine:master Feb 9, 2021

rmsc pushed a commit to rmsc/bevy that referenced this pull request Feb 10, 2021

System sets and parallel executor v2 (bevyengine#1144)

d83d490

System sets and parallel executor v2

alice-i-cecile mentioned this pull request Feb 11, 2021

Better documentation for explicit dependencies #1428

Merged

cart mentioned this pull request Feb 15, 2021

alien_cake_addict example - game board does not appear #1450

Closed

alice-i-cecile mentioned this pull request Feb 17, 2021

Resolve ambiguous system order bugs in default systems #1466

Closed

Ratysz deleted the executor-v2 branch February 19, 2021 12:17

sdfgeoff mentioned this pull request Mar 4, 2021

Ambiguous system order and Changed<Component> #1551

Closed

alice-i-cecile mentioned this pull request Mar 9, 2021

0.4 -> 0.5 Migration Guide #1601

Closed

Ratysz mentioned this pull request Mar 10, 2021

[Merged by Bors] - Resolve (most) internal system ambiguities #1606

Closed

This was referenced Mar 18, 2021

ablity for PrintDiagnosticsPlugin to be toggled via a Resource #1684

Open

[Merged by Bors] - System sets and run criteria v2 #1675

Closed

Ratysz mentioned this pull request Jun 27, 2021

System's executing order based component that the system read and writes #2379

Closed

Ratysz mentioned this pull request Sep 6, 2021

System builder syntax rework bevyengine/rfcs#31

Closed

alice-i-cecile mentioned this pull request Jan 6, 2022

Use scenes to serialize and deserialize resources #3576

Closed

cart mentioned this pull request Mar 10, 2022

Add &mut World as a system param and make .exclusive_system() optional #4166

Closed

helltraitor mentioned this pull request Aug 26, 2022

Can't change title at runtime. #2879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System sets and parallel executor v2 #1144

System sets and parallel executor v2 #1144

Ratysz commented Dec 24, 2020 •

edited

cart commented Feb 7, 2021

cart commented Feb 7, 2021

alice-i-cecile commented Feb 7, 2021

Ratysz commented Feb 7, 2021

cart commented Feb 7, 2021

Ratysz commented Feb 7, 2021

cart commented Feb 7, 2021

Ratysz commented Feb 8, 2021

Ratysz commented Feb 9, 2021

System sets and parallel executor v2 #1144

System sets and parallel executor v2 #1144

Conversation

Ratysz commented Dec 24, 2020 • edited

History

Glossary

Collateral changes

Algorithm

Abridged:

Full(ish):

Parallelization check

Notes

Things to do

cart commented Feb 7, 2021

cart commented Feb 7, 2021

alice-i-cecile commented Feb 7, 2021

Ratysz commented Feb 7, 2021

cart commented Feb 7, 2021

Ratysz commented Feb 7, 2021

cart commented Feb 7, 2021

Ratysz commented Feb 8, 2021

Ratysz commented Feb 9, 2021

Ratysz commented Dec 24, 2020 •

edited