Quickly find all blackhole directories with a huge amount of filesystem entries in a flat structure

Overview

findlargedir

GitHub license GitHub release Rust Report Card

About

Findlargedir is a tool specifically written to help quickly identify "black hole" directories on an any filesystem having more than 100k entries in a single flat structure. When a directory has many entries (directories or files), getting directory listing gets slower and slower, impacting performance of all processes attempting to get a directory listing (for instance to delete some files and/or to find some specific files). Processes reading large directory inodes get frozen while doing so and end up in the uninterruptible sleep ("D" state) for longer and longer periods of time. Depending on the filesystem, this might start to become visible with 100k entries and starts being a very noticeable performance impact with 1M+ entries.

Such directories mostly cannot shrink back even if content gets cleaned up due to the fact that most Linux and Un*x filesystems do not support directory inode shrinking (for instance very common ext3/ext4). This often happens with forgotten Web sessions directory (PHP sessions folder where GC interval was configured to several days), various cache folders (CMS compiled templates and caches), POSIX filesystem emulating object storage, etc.

Program will attempt to identify any number of such events and report on them based on calibration, ie. how many assumed directory entries are packed in each directory inode for each filesystem. While doing so, it will determine directory inode growth ratio to number of entries/inodes and will use that ratio to quickly scan filesystem, avoiding doing expensive/slow directory lookups. While there are many tools that scan the filesystem (find, du, ncdu, etc.), none of them use heuristics to avoid expensive lookups, since they are designed to be fully accurate, while this tool is meant to use heuristics and alert on issues without getting stuck on problematic folders.

Program will not follow symlinks and requires r/w permissions to calibrate directory to be able to calculate a directory inode size to number of entries ratio and estimate a number of entries in a directory without actually counting them. While this method is just an approximation of the actual number of entries in a directory, it is good enough to quickly scan for offending directories.

Demo

Caveats

  • requires r/w privileges for an each filesystem being tested, it will also create a temporary directory with a lot of temporary files which are cleaned up afterwards
  • accurate mode (-a) can cause an excessive I/O and an excessive memory use; only use when appropriate

Usage

Usage: findlargedir [OPTIONS] <PATH>...

Arguments:
  <PATH>...  Paths to check for large directories

Options:
  -a, --accurate <ACCURATE>
          Perform accurate directory entry counting [default: false] [possible values: true, false]
  -o, --one-filesystem <ONE_FILESYSTEM>
          Do not cross mount points [default: true] [possible values: true, false]
  -c, --calibration-count <CALIBRATION_COUNT>
          Calibration directory file count [default: 100000]
  -A, --alert-threshold <ALERT_THRESHOLD>
          Alert threshold count (print the estimate) [default: 10000]
  -B, --blacklist-threshold <BLACKLIST_THRESHOLD>
          Blacklist threshold count (print the estimate and stop deeper scan) [default: 100000]
  -x, --threads <THREADS>
          Number of threads to use when calibrating and scanning [default: 24]
  -p, --updates <UPDATES>
          Seconds between status updates, set to 0 to disable [default: 20]
  -i, --size-inode-ratio <SIZE_INODE_RATIO>
          Skip calibration and provide directory entry to inode size ratio (typically ~21-32) [default: 0]
  -t, --calibration-path <CALIBRATION_PATH>
          Custom calibration directory path
  -s, --skip-path <SKIP_PATH>
          Directories to exclude from scanning
  -h, --help
          Print help information
  -V, --version
          Print version information

When using accurate mode (-a parameter) beware that large directory lookups will stall the process completely for extended periods of time. What this mode does is basically a secondary fully accurate pass on a possibly offending directory calculating exact number of entries.

To avoid descending into mounted filesystems (as in find -xdev option), parameter one-filesystem mode (-o parameter) is toggled by default, but it can be disabled if necessary.

It is possible to completely skip calibration phase by manually providing directory inode size to number of entries ratio with -i parameter. It makes sense only when you already know the ratio, for example from previous runs.

Setting -p paramter to 0 will stop program from giving occasional status updates.

Benchmarks

Findlargedir vs GNU find

Mid-range server / mechanical storage

Hardware: 8-core Xeon E5-1630 with 4-drive SATA RAID-10

Benchmark setup:

$ cat bench1.sh
#!/bin/dash
exec /usr/bin/find / -xdev -type d -size +200000c

$ cat bench2.sh
#!/bin/dash
exec /usr/local/sbin/findlargedir /

Actual results measured with hyperfine:

$ hyperfine --prepare 'echo 3 | tee /proc/sys/vm/drop_caches' \
  ./bench1.sh ./bench2.sh

Benchmark 1: ./bench1.sh
  Time (mean ± σ):     223.890 s ±  1.614 s    [User: 1.820 s, System: 10.873 s]
  Range (min … max):   221.917 s … 226.496 s    10 runs

Benchmark 2: ./bench2.sh
  Time (mean ± σ):     127.142 s ±  1.712 s    [User: 48.812 s, System: 100.069 s]
  Range (min … max):   124.158 s … 129.316 s    10 runs

Summary
  './bench2.sh' ran
    1.76 ± 0.03 times faster than './bench1.sh'

High-end server / SSD storage

Hardware: 48-core Xeon Silver 4214, 7-drive SM883 SATA HW RAID-5 array, 2TB content (dozen of containers with small files)

Same benchmark setup. Results:

$ hyperfine --prepare 'echo 3 | tee /proc/sys/vm/drop_caches' \
  ./bench1.sh ./bench2.sh

Benchmark 1: ./bench1.sh
  Time (mean ± σ):     397.769 s ±  0.946 s    [User: 16.870 s, System: 86.359 s]
  Range (min … max):   396.341 s … 399.280 s    10 runs

Benchmark 2: ./bench2.sh
  Time (mean ± σ):     88.763 s ±  0.412 s    [User: 445.974 s, System: 2033.375 s]
  Range (min … max):   88.284 s … 89.428 s    10 runs

Summary
  './bench2.sh' ran
    4.48 ± 0.02 times faster than './bench1.sh'
You might also like...
A cli tool to download specific GitHub directories or files

cloneit A cli tool to download specific GitHub directories or files. Installation From git git clone https://github.com/alok8bb/cloneit cd cloneit car

Voila is a domain-specific language launched through CLI tool for operating with files and directories in massive amounts in a fast & reliable way.

Voila is a domain-specific language designed for doing complex operations to folders & files. It is based on a CLI tool, although you can write your V

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

Tool for managing dotfiles directories; Heavily based on rcm.

Paro paro : to prepare, get ready / set, put / furnish, supply. Tool for managing dotfiles directories; Heavily based on rcm. TODO Rust Boilerplate CI

Semi-persistent, scoped test directories

Semi-persistent, scoped test directories This crate aims to make it easier to use temporary directories in tests, primarily by calling the testdir!()

Wraps cargo to move target directories to a central location [super experimental]

targo Wraps cargo to move target directories to a central location [super experimental] To use, cargo install --git https://github.com/sunshowers/targ

Temporary files and directories with UTF-8 paths.

camino-tempfile A secure, cross-platform, temporary file library for Rust with UTF-8 paths. This crate is a wrapper around tempfile that works with th

Bookmark directories for easy directory-hopping in the terminal
Bookmark directories for easy directory-hopping in the terminal

markd Bookmark directories for easy directory-hopping in the terminal. All it takes is one command markd to bookmark your current directory, or use th

Utility to display a tree view of directories.

TreeCraft v0.2.3 (16 October 2023) TreeCraft is a command-line utility written in pure Rust that helps you visualize directory structures in ASCII for

Releases(0.3.6)
Owner
Dinko Korunic
If I had 8 hours to chop down a tree, I would spend 6 of those hours sharpening my axe.
Dinko Korunic
Sniffer - a tool to quickly inspect csv and flat-file files for basic information

sniffer sniffer is a tool to quickly inspect csv and flat-file files for basic information. Need to see how many rows are in a csv file? Want to see t

Daniel B 10 Apr 4, 2023
fas stand for Find all stuff and it's a go app that simplify the find command and allow you to easily search everything you nedd

fas fas stands for Find all stuff and it's a rust app that simplify the find command and allow you to easily search everything you need. Note: current

M4jrT0m 1 Dec 24, 2021
Find and clean heavy build or cache directories.

ProjClean Find and clean heavy build or cache directories. ProjClean finds directories such as node_modules(node), target(rust), build(java) and their

null 42 Sep 25, 2022
Yfin is the Official package manager for the Y-flat programming language

Yfin is the Official package manager for the Y-flat programming language. Yfin allows the user to install, upgrade, and uninstall packages. It also allows a user to initialize a package with the Y-flat package structure and files automatically generated. In future, Yfin will also allow users to publish packages.

Jake Roggenbuck 0 Mar 3, 2022
Extension to `thiserror` that helps reduce the amount of handwriting

justerror This macro piggybacks on thiserror crate and is supposed to reduce the amount of handwriting when you want errors in your app to be describe

ShakaCode 2 Nov 16, 2022
A small program which makes a rofi game launcher menu possible by creating .desktop entries for games

rofi-games A small program which makes a `rofi` game launcher menu possible by creating `.desktop` entries for games Installation Manual Clone repo: g

Rolv Apneseth 20 May 4, 2023
locdev is a handy CLI tool that simplifies the process of adding, removing, and listing entries in the hosts file.

locdev ??️ locdev is a handy CLI tool that simplifies the process of adding, removing, and listing entries in the hosts file. You no longer need to de

Nick Rempel 20 Jun 5, 2023
Tool that was built quickly for personal needs, you might find it useful though

COPYCAT Produced with stable-diffusion Clipboard (copy/paste) history buffer for terminal emulators, MAC OS gui and VIM* environment usage. Rrequireme

Dragan Jovanović 4 Dec 9, 2022
ddi is a wrapper for dd. It takes all the same arguments, and all it really does is call dd in the background

ddi A safer dd Introduction If you ever used dd, the GNU coreutil that lets you copy data from one file to another, then you may have encountered a ty

Tomás Ralph 80 Sep 8, 2022
Scan the symbols of all ELF binaries in all Arch Linux packages for usage of malloc_usable_size

Scan the symbols of all ELF binaries in all Arch Linux packages for usage of malloc_usable_size (-D_FORTIFY_SOURCE=3 compatibility)

null 3 Sep 9, 2023