Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System sets and parallel executor v2 #1144

Merged
merged 78 commits into from Feb 9, 2021
Merged

Conversation

Ratysz
Copy link
Contributor

@Ratysz Ratysz commented Dec 24, 2020

It's a wall of text, I know - you don't have to read all of it, its purpose is to clarify the details should questions arise.

History

Prior discussion: this comment and on.

This PR builds on top of the incomplete SystemSet implementation branch that didn't make it into 0.4; it was split from schedule-v2 branch before the merge.

The branch introduced system sets: SystemStage is now made up of one or more SystemSet, each has its own run criterion and contains the systems whose execution said criterion specifies.

It also moved ShouldRun to schedule's level in the file hierarchy, and implemented run criteria and related methods as a reusable struct.

The rebase that was required to make this branch mergeable was messy beyond belief, so I opted to merge in the master instead. It's still rather messy, but at least the commits are coherent.

Glossary

  • exclusive system - system that requires exclusive (mutable) access to entirety of world and/or resources. The new implementation automatically executes these sequentially, either at the start or end (not fully implemented) of the stage.
  • parallel/parallelizable system - any system that doesn't require exclusive access. These are executed in the middle of the stage, as many as possible at once.
  • thread-local system - parallelizable system that accesses one or several thread-local resources (the ones in !Send storage).
  • thread-agnostic system - parallelizable system that doesn't access any thread-local resources.

Collateral changes

2021-01-23: this is not exhaustive, and outdated - the list does not cover commits made since original post.

Necessary (and not so much) changes that aren't directly related to the implementation:

  • Renamed access::Access to ArchetypeAccess, made it private. No particular reason.
  • Removed ThreadLocalExecution, System::thread_local_execution() - this information is now encoded in ::archetype_component_access(), ::resource_access(), and new ::is_thread_local().
  • Extended TypeAccess with support for "reads/writes everything" to facilitate the above.
  • Added CondensedTypeAccess.
  • Moved behavior of thread_local_func (command buffer merging) and similar fields from parallelizable System implementors into their run_exclusive() implementation.
  • Renamed ThreadLocalSystemFn and the module into_thread_local to ExclusiveSystemFn and into_exclusive respectively, to differentiate systems with this access pattern from actually thread-local systems.
  • Implemented IntoSystem for FnMut(&mut World) and FnMut(&mut Resources), both result in ExclusiveSystemFn. No particular reason.
  • Implemented ThreadLocal system parameter.
  • Implemented system insertion methods with labels and dependencies (with string labels for now) for SystemSet and SystemStage.
  • Added ShouldRun::NoAndLoop.
  • Renamed System::update() to System::update_access(), for clarity.
  • Changed system sets to store systems in NonNull rather than Box. Requires auditing, and a sanity check.

Algorithm

2021-01-23: this is slightly outdated now, due to changes to exclusive systems and availability of local-to-thread tasks.

Reading is not required. It's here because people were curious and picking apart someone else's code is unpleasant, annotated or not.

The idea is similar to those I used in yaks and this prototype. I also wrote an entire long thing about the topic.

Abridged:

  1. Evaluate run criteria of system sets.
  2. If any of the sets have changed, rebuild the cached scheduling data.
  3. Run exclusive systems that want to be at the start of stage.
  4. If world's archetypes were changed since the last time parallel systems were ran, update their access and the relevant part of scheduling data.
  5. Enter outer compute pool scope.
  6. Prepare parallel systems - spawn tasks, reset counters, queue systems with no dependencies, etc.
  7. Try running a queued thread-local system. Queue its dependants if this was the last system they were waiting on.
  8. Enter inner compute pool scope.
  9. Try starting some queued thread-agnostic systems.
  10. See if any thread-agnostic systems have finished, queue their dependants if those were the last systems they were waiting on.
  11. Exit inner scope.
  12. If there are any queued or running systems, continue from step 7.
  13. Exit outer scope.
  14. Merge in command buffers.
  15. Run exclusive systems that want to be at the end of stage.
  16. Re-evaluate run criteria, continue from step 3 if needed.

Full(ish):

  1. As needed, stage calls <ParallelSystemStageExecutor as SystemStageExecutor>::execute_stage() with &mut [SystemSet] that contains the systems with their uncondensed access sets, &HashMap<SystemIndex, Vec<SystemIndex>> that encodes the dependency tree, &mut World, &mut Resources.
  2. Run criteria of the sets are evaluated. If no sets should be ran, algorithm returns early; if no sets should be ran yet any number of them request their run criteria to be reevaluated, it panics (to avoid looping infinitely).
  3. If any of the system sets have had new systems inserted, pre-existing scheduling data is discarded, and fresh data is constructed:
    1. Any cached data is cleared.
    2. Systems of sets are iterated, collecting all distinct (not "access everything") accessed types of parallel systems into a pair of hash sets (one for archetype-components, another for resources). If the world's archetypes were also changed, systems' access is updated before collecting its types.
    3. Hash sets of types are converted into vectors. Bitsets are resized to be able to fit as many bits as there are parallel systems.
    4. Systems of sets are iterated again, sorting indices of systems with exclusive access into list of exclusive systems, generating partial parallel scheduling data by "condensing" access sets to bitsets (the bitset is a mask that produces the systems' access set when applied to the vector of all types accessed by the stage). At the same time, a map of SystemIndex to the respective system's usize index in the parallel systems vector is populated.
    5. Dependencies map is iterated, inserting dependants' indices into their dependencies' parallel scheduling data, using the map from the previous step.
  4. All exclusive systems that want to run at the start of stage (currently all exclusive systems) are executed, if their parent system sets' criteria call for it.
  5. If the archetypes were changed since the last time parallel systems were ran, said systems' accesses are updated, and their archetype-component bitsets are recondensed (types are collected and sets are converted into bitset masks).
  6. Compute pool scope is created and entered.
  7. Each parallel system is prepared for execution:
    1. Safety bit is reset.
    2. Whether the system should be ran this iteration (the result of its parent system set's run criterion evaluation) is cached into a bitset.
    3. If the system should be ran, its task is spawned (if it's not a thread-local system) into the scope, and it's queued to run if it has no dependencies, or has its dependency counter reset otherwise.
  8. Running a thread-local system on the main thread is attempted:
    1. Queued systems are filtered by thread-local systems (bitset intersection).
    2. The first system out of those that passes the parallelization check (see below) is executed on the main thread.
    3. Its index is removed from the queued systems, and any of its dependants that should run this iteration have their dependency counters decremented by one.
    4. Dependants that have their counter reach zero are queued.
  9. A new inner compute pool scope is created and entered, and thread-agnostic system execution task is spawned into it:
    1. Queued systems are filtered by not thread-local systems (bitset difference).
    2. Any systems that pass the parallelization check (see below) are signalled to start via a channel and are marked as running.
    3. "Active access" sets representing the union of all running systems' access sets are extended with the newly running systems' sets.
    4. Newly running systems are removed from the queue.
    5. If there are any systems running, wait for at least one of them to finish, unmark it as running.
    6. Any of now finished systems' dependants that should run this iteration have their dependency counters decremented by one.
    7. Dependants that have their counter reach zero are queued.
    8. Active access sets are rebuild with access sets of all still running systems.
  10. When the inner scope is done, if there are any running or queued systems, continue from step 8. Otherwise, exit outer scope.
  11. Command buffers of parallel systems (that should have ran this iteration) are merged in by running their exclusive part.
  12. All exclusive systems that want to run at the end of stage (currently none of exclusive systems) are executed, if their parent system sets' criteria call for it.
  13. System sets' run criteria that requested to be checked again are reevaluated. If no sets should be ran, algorithm returns; if no sets should be ran yet any number of them request their run criteria to be reevaluated, it panics (to avoid looping infinitely). Otherwise, continue from step 4.

Parallelization check

System's condensed resource access is checked against active resource access, ditto for archetype-component access. "Checked against" here means "if one reads all, other can't write anything, otherwise check if bitsets of writes are disjoint with bitsets of reads and writes". That's it.

Notes

  • SystemIndex is a pair of usize indices: system's parent set and its position in it.
  • Labels used by the stage are not used by the executor, it instead relies solely on the SystemIndex mapping.
  • Hot-path machinery uses usize indices (corresponding to systems' position in the vector of parallel scheduling data) and bitsets.
  • Parallel systems can request immutable access to all of world and/or resources and are seamlessly scheduled along others.
  • Thread-agnostic systems are not required to finish in the same inner scope that starts them.
  • Systems belonging to sets whose criteria disable them are simply never queued, and their tasks (for thread-agnostic systems) are not spawned.

Things to do

Numbered for discussion convenience:

  1. Audit unsafe code: ThreadLocal system parameter implementation, NonNull use in SystemSet.
  2. Targeted tests of new behaviors and invariants. Smoke tests for future development convenience.
  3. Dependency tree verification. I think the best place to do it is in SystemStage::run_once(), where all the necessary information is first seen in one place. mostly done, still refactoring
  4. Exclusive systems dependencies and sorting. Should be done at the same time as point 3. There are stubs in the implementation that allow executing exclusive systems at either start of stage or end of stage, but no way to specify when such system would like to be ran. I think this could be handled by dependencies: exclusive systems that depend on parallel systems are automatically shoved to the end of stage, and vice versa; conflicts result in a panic during tree verification. partially irrelevant due to full exclusive/parallel schism, sorting is tolopogical now but will
  5. Serial stage executor. As of now, it's barely functional(?) and does no dependency-based sorting whatsoever; this should probably be handled after point 4. mostly done, still refactoring
  6. Decide if we want the safety bit. I've never seen that check fail, but it's about as lightweight as it can be, so I say we keep it. removed, it became completely trivial with introduction of local-to-thread tasks
  7. Decide how to handle the run criteria infinite loop (when all criteria evaluate to a combination of ShouldRun::No and ShouldRun::NoAndLoop). Current strategy is to panic, could use a more meaningful panic message.
  8. Decide if we want to sort parallel systems' command buffer merging in any way. Currently, it's not defined, but should be ordered by set then insertion order into set. it's topologically sorted now, exploiting a side-effect of validating the dependency graph
  9. Decide if we want to merge in parallel systems' command buffers before or after end-of-stage exclusives are run. Currently, it's before. we have both options now, thanks to point 12
  10. Consider "optional dependencies". Right now, if system A depends on system B and B gets disabled by a run criterion of its parent system set, A will not run and there is no way to make it run without running B. This should not be the case: execution order dependencies should only specify execution order and never execution fact. all dependencies are now "soft", disabling a system does not disable its dependants; considering supporting "hard" dependencies
  11. Consider simplifying the TypeAccess implementation by merging AccessSet into it. See CondensedTypeAccess for comparison. done in c4e8261
  12. Consider better API for system insertion into stage/set. Related: Improved system insertion syntax #1060. adopted the "system descriptor" builder pattern
  13. Consider system labels that aren't &'static str.
  14. Plumb the labels/dependencies API throughout the hierarchy; best done after point 12.
  15. Actually make use of system sets in StateStage, etc.
  16. Tighten API in general: currently some of the things that have no business being public are public.
  17. Minor refactors and internal tweaks, mostly for my own peace of mind regarding maintainability.
  18. Documentation, examples. I haven't done any public documentation, but I think there's plenty of internal documentation of the executor itself.
  19. Propagate the newly available/required patterns to the rest of the engine and examples, benchmark. This should be done after the API-related concerns are addressed, of course. For what it's worth, the examples I've tried running so far all have Just Worked.

This should cover vast majority of things, but I'm bound to have missed something. I will slowly start on some of the items in the todo list, however, most of them will benefit from discussion at this point, and some can be tackled by someone else and/or in a different PR.

Happy holidays!

cart and others added 30 commits December 5, 2020 12:12
Renamed `access::Access` to `ArchetypeAccess`, made it private. No particular reason.
Removed `ThreadLocalExecution`, `System::thread_local_execution()` - this information is now encoded in `::archetype_component_access()`, `::resource_access()`, and new `::is_thread_local()`.
Extended `TypeAccess` with support for "reads/writes everything" to facilitate the above.
Renamed `ThreadLocalSystemFn` and the module `into_thread_local` to `ExclusiveSystemFn` and `into_exclusive` respectively, to differentiate systems with this access pattern from actually thread-local systems.
Implemented `IntoSystem` for `FnMut(&mut World)` and `FnMut(&mut Resources)`, both result in `ExclusiveSystemFn`. Just because. Probably in vain.
Rewrote `SerialSystemStageExecutor` to account for `ThreadLocalExecution` removal.
Implemented `ThreadLocal` system parameter.
Implemented mk1 label and dependencies methods on `SystemSet` and `SystemStage`.
Added `ShouldRun::NoAndLoop`.
Renamed `System::update()` to `System::update_access()`, for clarity.
Plumbed labels and dependencies into the stage executor, via `SystemStage::labels` (a stage-global map of label to system index), `SystemSet::system_label()`, and `SystemSet::system_dependencies()`.
Rewrote `SerialSystemStageExecutor` to support exclusive systems at start of stage as well; unfinished.
Began implementation; doesn't do anything yet because the pathway from data it gets to data it wants is not implemented.
Changed system sets to store systems in `NonNull` rather than `Box`. Requires auditing and a sanity check.
Minor cleanup.
Partially implemented run criteria handling.
Minor refactor.
Fixed a future bug in not yet implemented `can_start_now()`.
Accounted for run criteria for parallel systems: if a system shouldn't run in this iteration, its task is not spawned and it's never queued to run.
Added `CondensedTypeAccess` and related machinery.
Moved the dependency resolution burden from executor to the stage; verification is not yet implemented.
Systems with no dependencies are immediately queued to run at the start of each iteration.
Implemented rebuilding scheduling data and updating access information.
Implemented `can_start_now()`.

Running tests violates access, right now.
# Conflicts:
#	crates/bevy_app/src/app.rs
#	crates/bevy_app/src/app_builder.rs
#	crates/bevy_asset/src/lib.rs
#	crates/bevy_core/src/time/fixed_timestep.rs
#	crates/bevy_ecs/Cargo.toml
#	crates/bevy_ecs/src/lib.rs
#	crates/bevy_ecs/src/schedule/mod.rs
#	crates/bevy_ecs/src/schedule/stage.rs
#	crates/bevy_ecs/src/schedule/stage_executor.rs
#	crates/bevy_ecs/src/schedule/state.rs
#	crates/bevy_ecs/src/system/into_system.rs
#	crates/bevy_ecs/src/system/into_thread_local.rs
#	crates/bevy_ecs/src/system/system.rs
#	crates/bevy_ecs/src/system/system_chaining.rs
#	crates/bevy_ecs/src/system/system_param.rs
#	crates/bevy_input/src/lib.rs
#	crates/bevy_render/src/lib.rs
#	crates/bevy_render/src/render_graph/system.rs
#	crates/bevy_scene/src/lib.rs
#	crates/bevy_transform/src/hierarchy/hierarchy_maintenance_system.rs
#	crates/bevy_transform/src/transform_propagate_system.rs
#	crates/bevy_ui/src/lib.rs
#	crates/bevy_ui/src/update.rs
#	crates/bevy_winit/src/lib.rs
#	examples/2d/texture_atlas.rs
#	examples/ecs/ecs_guide.rs
#	examples/ecs/fixed_timestep.rs
#	examples/ecs/state.rs
@cart
Copy link
Member

cart commented Feb 7, 2021

Yup this seems correct / works for me:

image

I'm pushing my changes now. If someone else can verify that this works for them too, that would be great.

@cart
Copy link
Member

cart commented Feb 7, 2021

Ah wait I just missed the ambiguity report. Everything still makes sense

image

@alice-i-cecile
Copy link
Member

Testing again on your branch @cart (commit hash: e23c8a8). I'm getting fixed behavior over 10 tries each of the breakout and ui examples :)

@Ratysz
Copy link
Contributor Author

Ratysz commented Feb 7, 2021

Can confirm, before e23c8a8 the score text sometimes is not visible (strange that I didn't notice that before), after it's always visible (for N=25, anyway).

@cart
Copy link
Member

cart commented Feb 7, 2021

Alrighty I think it is time to merge this. I do think its worth investing some time in optimization. Its a bit of a shame that most "simple" games like "breakout" (and "simple" benchmarks) will regress in perf. Breakout frame time regresses by ~0.1 milliseconds (0.6% of our total frame time budget at 60fps).

@Ratysz
Copy link
Contributor Author

Ratysz commented Feb 7, 2021

Simple games should probably be using SystemStage::single_threaded() anyway. Why would anyone want to parallelize one ball and one paddle, there's no way that code is heavier than multithreading machinery.

@cart
Copy link
Member

cart commented Feb 7, 2021

In theory: yes, but in practice people will use the default (which should almost certainly be parallel or we're doing something wrong).

@Ratysz
Copy link
Contributor Author

Ratysz commented Feb 8, 2021

I've added more tests to ambiguity detection. Couldn't yet fix the bug we found earlier, so CI should fail. If someone wants to take a stab at it the relevant function is find_ambiguities in stage.rs. Also please verify that the relevant tests themselves aren't broken, because I'm not certain of anything at this point.

@Ratysz
Copy link
Contributor Author

Ratysz commented Feb 9, 2021

It should be fixed now - it's passing the tests (some are pretty convoluted). It's starting to remind me of matrix operations more and more; a path towards further optimisations, perhaps.

@cart cart merged commit d5a7330 into bevyengine:master Feb 9, 2021
rmsc pushed a commit to rmsc/bevy that referenced this pull request Feb 10, 2021
System sets and parallel executor v2
@Ratysz Ratysz deleted the executor-v2 branch February 19, 2021 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Enhancement A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants