Minimalist pedantic command line parser

Jan Verbeek

Last update: Dec 28, 2022

Related tags

Parsing lexopt

Overview

Lexopt

Lexopt is an argument parser for Rust. It tries to have the simplest possible design that's still correct. It's so simple that it's a bit tedious to use.

Lexopt is:

Small: one file, no dependencies, no macros. Easy to audit or vendor.
Correct: standard conventions are supported and ambiguity is avoided. Tested and fuzzed.
Pedantic: arguments are returned as OsStrings, forcing you to convert them explicitly. This lets you handle badly-encoded filenames.
Imperative: options are returned as they are found, nothing is declared ahead of time.
Annoyingly minimalist: only the barest necessities are provided.
Unhelpful: there is no help generation and error messages often lack context.

Example

struct Args {
    thing: String,
    number: u32,
    shout: bool,
}

fn parse_args() -> Result<Args, lexopt::Error> {
    use lexopt::prelude::*;

    let mut thing = None;
    let mut number = 1;
    let mut shout = false;
    let mut parser = lexopt::Parser::from_env();
    while let Some(arg) = parser.next()? {
        match arg {
            Short('n') | Long("number") => {
                number = parser.value()?.parse()?;
            }
            Long("shout") => {
                shout = true;
            }
            Value(val) if thing.is_none() => {
                thing = Some(val.into_string()?);
            }
            Long("help") => {
                println!("Usage: hello [-n|--number=NUM] [--shout] THING");
                std::process::exit(0);
            }
            _ => return Err(arg.unexpected()),
        }
    }

    Ok(Args {
        thing: thing.ok_or("missing argument THING")?,
        number,
        shout,
    })
}

fn main() -> Result<(), lexopt::Error> {
    let args = parse_args()?;
    let mut message = format!("Hello {}", args.thing);
    if args.shout {
        message = message.to_uppercase();
    }
    for _ in 0..args.number {
        println!("{}", message);
    }
    Ok(())
}

Let's walk through this:

We start parsing with Parser::from_env().
We call parser.next() in a loop to get all the arguments until they run out.
We match on arguments. Short and Long indicate an option.
To get the value that belongs to an option (like 10 in -n 10) we call parser.value().
- This returns a standard OsString.
- For convenience, use lexopt::prelude::* adds a .parse() method, analogous to str::parse.
Value indicates a free-standing argument. In this case, a filename.
- if thing.is_none() is a useful pattern for positional arguments. If we already found thing we pass it on to another case.
- It also contains an OsString.
  - The standard .into_string() method can decode it into a plain String.
If we don't know what to do with an argument we use return Err(arg.unexpected()) to turn it into an error message.
Strings can be promoted to errors for custom error messages.

This covers almost all the functionality in the library. Lexopt does very little for you.

For a larger example with useful patterns, see examples/cargo.rs.

Command line syntax

The following conventions are supported:

Short options (-q)
Long options (--verbose)
-- to mark the end of options
= to separate long options from values (--option=value)
Spaces to separate options from values (--option value, -f value)
Unseparated short options (-fvalue)
Combined short options (-abc to mean -a -b -c)

These are not supported:

-f=value for short options
Options with optional arguments (like GNU sed's -i, which can be used standalone or as -iSUFFIX)
Single-dash long options (like find's -name)
Abbreviated long options (GNU's getopt lets you write --num instead of --number if it can be expanded unambiguously)

Unicode

This library supports unicode while tolerating non-unicode arguments.

Short options may be unicode, but only a single codepoint. (If you need whole grapheme clusters you can use a long option. If you need normalization you're on your own, but it can be done.)

Options can be combined with non-unicode arguments. That is, --option=�� will not cause an error or mangle the value. This is surprisingly tricky to support: see os_str_bytes.

Options themselves are patched as by String::from_utf8_lossy if they're not valid unicode. That typically means you'll raise an error later when they're not recognized.

Why?

For a particular application I was looking for a small parser that's pedantically correct. There are other compact argument parsing libraries, but I couldn't find one that handled OsStrings and implemented all the fiddly details of the argument syntax faithfully.

This library may also be useful if a lot of control is desired, like when the exact argument order matters or not all options are known ahead of time. It could be considered more of a lexer than a parser.

Why not?

This library may not be worth using if:

You don't care about non-unicode arguments
You don't care about exact compliance and correctness
You don't care about code size
You do care about great error messages
You hate boilerplate

clap/structopt: very fully-featured. The only other argument parser for Rust I know of that truly handles invalid unicode properly, if used right. Large.
argh and gumdrop: much leaner, yet still convenient and powerful enough for most purposes. Panic on invalid unicode.
- argh adheres to the Fuchsia specification and therefore does not support --option=value and -ovalue, only --option value and -o value.
pico-args: slightly smaller than lexopt and easier to use (but less rigorous).
ap: I have not used this, but it seems to support iterative parsing while being less bare-bones than lexopt.
libc's getopt.

pico-args has a nifty table with build times and code sizes for different parsers. I've rerun the tests and added lexopt (using the program in examples/pico_test_app.rs):

	null	lexopt	pico-args	clap	gumdrop	structopt	argh
Binary overhead	0KiB	14.5KiB	13.5KiB	372.8KiB	17.7KiB	371.2KiB	16.8KiB
Build time	0.9s	1.7s	1.6s	13.0s	7.5s	17.0s	7.5s
Number of dependencies	0	0	0	8	4	19	6
Tested version	-	0.1.0	0.4.2	2.33.3	0.8.0	0.3.22	0.1.4

(Tests were run on x86_64 Linux with Rust 1.53 and cargo-bloat 0.10.1.)

Comments

How to get remaining arguments

To implement external sub-commands, eg. git foo --bar --baz, I need to invoke a binary eg. git-foo in this case with arguments --bar --baz.

I can't figure out though, how to get the remaining arguments as plain strings, and not as Long/Short/Value. Any ideas how I might achieve that?

opened by cloudhead 22
Include the binary's name when calling `from_args()`
from_env() takes the executable's name from the argv, but from_args() expects an argv with the name removed. This discrepancy causes a bit of trouble:

this is slightly unexpected (although documented)

if one uses from_args(), then bin_name() becomes useless (always returns None)

I think it'd be simpler for everyone involved if from_args() took the same input as from_env() does, i.e. an argv with executable's name.
opened by Minoru 12

The position() function can be tricky to use in while-let loops

Consider this code:

fn find_first_long_value_with_position() -> Result<Option<(usize, String)>, Error> {
    let mut p = parse("-s --long");
    while let Some(arg) = p.next()? {
        match arg {
            Long(val) => return Ok(Some((p.position(), val.to_owned()))),
            _ => (),
        }
    }
    Ok(None)
}

Unfortunately, this code does not compile due to a double borrow:

|
|   while let Some(arg) = p.next()? {
|                         - mutable borrow occurs here
|       match arg {
|           Long(val) => return Ok(Some((p.position(), val.to_owned()))),
|                                        ^             --- mutable borrow later used here
|                                        |
|                                        immutable borrow occurs here

There are several ways around this; one of them is to simply swap the order of the tuple:

fn find_first_long_value_with_position() -> Result<Option<(String, usize)>, Error> {
    let mut p = parse("-s --long");
    while let Some(arg) = p.next()? {
        match arg {
            Long(val) => return Ok(Some((val.to_owned(), p.position()))),
            _ => (),
        }
    }
    Ok(None)
}

More generally, we just need to ensure that val's non-lexical lifetime has ended before the call to p.position(), so even with the original tuple ordering, we could do something like:

match arg {
    Long(val) => {
        let val = val.to_owned();
        return Ok(Some((p.position(), val)));
    }
    _ => (),
}

I don't think there is a clean way around this while still exposing position() as a public function that takes &self as an argument. After all, even bin_name() has the same difficulties, but I don't think anyone will be generating tuples of arguments with the bin name.

In my opinion, there are several paths forward:

Revert the position() functionality that was implemented in #2. This would make me a bit sad, because I'd really like to see this functionality available in this library, but I would understand.
Leave things the way they are, and thoroughly document how to properly navigate the borrowing nuances. The current functionality is correct and fully usable. It just has some pain points with the borrow checker if you're not careful.
Make the position() function private (or remove it), and change the definition of Arg to always include the position:

pub enum Arg<'a> {
    /// A short option, e.g. `-q`.
    Short(usize, char),
    /// A long option, e.g. `--verbose`. (The dashes are not included.)
    Long(usize, &'a str),
    /// A positional argument, e.g. `/dev/null`.
    Value(usize, OsString),
}

Users who don't care about the position could just ignore it in the match:

match arg {
    Short(_, val) => unimplemented!(),
    Long(_, val) => unimplemented!(),
    Value(_, val) => unimplemented!(),
}

Make the position() function private (or remove it), and make the position data member an Rc<Cell<Option<usize>>>. Then handle the complexity internally, providing two ways to get the next Arg: one that also provides the position, and one that does not.

We could add a function like the following:

pub fn next_enumerated(&mut self) -> Result<Option<(usize, Arg<'_>)>, Error> {
    let pos = self.position.clone(); // Just cloning the Rc
    self.next().map(|option| option.map(|arg| (pos.get().expect("argument count overflow"), arg)))
}

This is more in line with my original suggestion of creating an EnumeratedParser, except we're only adding one new function.

I like option 4) the most because it supports the position functionality without altering the existing functionality, and without user-facing borrow-checker complexity.

Importantly, this is still a breaking change, and so is adding the position() function in the first place. Initially, I thought it wasn't, but I realized that it is after giving it more thought. (See example playground.) I understand if this might change your opinion on the matter, possibly even toward option 1.

For more context, my personal use case for this function is something more like this:

pub struct Token;

// Clone trait bound is just to prevent conflict with the blanket
// implementation of impl<T> From<T> for T
impl<T: Clone> From<T> for Token {
    fn from(_: T) -> Self {
        Self {}
    }
}

#[derive(Debug)]
pub struct EnumeratedToken {
    position: usize,
    token: Token,
}

impl EnumeratedToken {
    fn new(position: usize, token: Token) -> Self {
        Self { position, token }
    }
}

fn lex_enumerated_tokens() -> Result<Vec<EnumeratedToken>, Error> {
    let mut p = lexopt::Parser::from_env();
    let mut tokens = Vec::with_capacity(std::env::args_os().len());
    while let Some((position, arg)) = p.next_enumerated()? {
        match arg {
            Short(val) => tokens.push(EnumeratedToken::new(position, val.try_into()?)),
            Long(val) => tokens.push(EnumeratedToken::new(position, val.try_into()?)),
            Value(val) => tokens.push(EnumeratedToken::new(position, val.try_into()?)),
        }
    }
    Ok(tokens)
}

I also want to note that whatever decision we come to (if it requires implementing new functionality), I'm happy to submit a pull request or do some code review, if you're open to such contributions.

Thanks again for your hard work on this!

opened by nordzilla 7

ability to treat all remaining arguments as free arguments

Hi, really nice library.

I would like to stop parsing options when I encounter a "free" argument (a Arg::Value) and treat all subsequent arguments as "free" arguments, regardless if they contain option flags. Ideally, I would be able to call something like parser.finish() (fn finish(self) -> Vec<OsString>) to get the remaining arguments. I didn't see an obvious way to do it with the current API.

opened by mllken 5
Parser's value() method and dashes
The docs for Parser's value() method state:

"A value is collected even if it looks like an option (i.e., starts with -)."

This causes friction when parsing an option that takes a value. e.g. say I have a

usage: prog [-v] [-o file]

handling the -o case:

Short('o') => filename = parser.value()?,

If the user runs "prog -o -v", then '-v' becomes the filename, without any error. I would expect calling .value() to return an Err if the value starts with '-'.

I think returning an error would make it more consistent with Parser's values() method, which returns an error if the first value contains a dash. And also with Parser's next() method, which cannot return a Value(value), where value contains a dash.

Could we error if the return value to parser.value() starts with a '-' ? Or let me know if I'm missing something here. Thanks
opened by mllken 4
Keeping track of argument position
I asked on reddit about the possibility of enumerating the arguments that are parsed by lexopt.

As @blyxxyz pointed out, this is possible to do with the current implementation (0.1.0), but the approach is not recommended.

@blyxxyz also suggested that a potential option moving forward could be to add a method to the parser that the user could call at any time to retrieve the position of most-recently parsed argument.

Parser::position(&self) -> u64

I think this is great idea as a minimally-invasive, non-breaking change to the existing implementation that also keeps the API very minimalist and correct.

I believe that the position returned for the first-parsed argument would have to be 1, since the zero-index argument as a whole is technically the binary name. If someone were to call position() before calling next(), then 0 is the only logical choice for this scenario.

I would expect the functionality to be something like this:

let mut p = parse("-a -b -c value"); assert_eq!(p.position(), 0); assert_eq!(p.next()?.unwrap(), Short('a')); assert_eq!(p.position(), 1); assert_eq!(p.next()?.unwrap(), Short('b')); assert_eq!(p.position(), 2); assert_eq!(p.next()?.unwrap(), Short('c')); assert_eq!(p.position(), 3); assert_eq!(p.value()?, "value"); assert_eq!(p.position(), 4);

let mut p = parse("-ab -cvalue"); assert_eq!(p.position(), 0); assert_eq!(p.next()?.unwrap(), Short('a')); assert_eq!(p.position(), 1); assert_eq!(p.next()?.unwrap(), Short('b')); assert_eq!(p.position(), 1); assert_eq!(p.next()?.unwrap(), Short('c')); assert_eq!(p.position(), 2); assert_eq!(p.value()?, "value"); assert_eq!(p.position(), 2);
opened by nordzilla 3
Support = as a separator for short options
Resolves #8. @theIDinside, maybe you'd like to take a look?

I made these choices:

If = is found when no value is expected it's just treated as a short option. Clap also does this. (I wanted to make it an error first, but didn't know what to do with -=.) (ETA: after this PR I decided to turn it into an error after all, except in the -= case, where it stays a short option.)

= is allowed for combined short options, like -xo=value. Clap does this, argparse doesn't.

It has to be at most a single =. Clap allows multiple, like -o========value.
opened by blyxxyz 2
Short options should be able to take value with format -o=v

Would be nice to be able to say -o=some_value to adhere more closely to standards and other libraries / crates and thus make the switch away from more "heavy" libraries to this one a much more pleasant experience. For those who wish to have compilation speed and much less bloat, this would make that move much easier and not justifying the move much harder.

opened by theIDinside 2
Why does `bin_name()` return `str` rather than `OsString`?

lexopt tries not to convert strings unless absolutely necessary, but bin_name() deviates from that. The source code doesn't explain why. If bin_name() returned OsString, that'd be more consistent with other methods.

opened by Minoru 2
Retrieve argument count from the Parser
There is currently no way to tell the total length of arguments from the lexopt::Parser.

Consider this scenario:

pub struct Token; // Clone trait bound is just to prevent conflict with the blanket // implementation of impl<T> From<T> for T impl<T: Clone> From<T> for Token { fn from(_: T) -> Self { Self {} } } fn lex_tokens() -> Result<Vec<Token>, Error> { let mut tokens = Vec::new(); let mut p = parse("-a -bc value --long"); while let Some(arg) = p.next()? { match arg { Short(val) => tokens.push(val.try_into()?), Long(val) => tokens.push(val.try_into()?), Value(val) => tokens.push(val.try_into()?), } } Ok(tokens) }

It would be convenient if the parser were able to retrieve the argument count before the loop, so that we can initialize the Vec with a starting capacity that we know will hold all of the tokens:

fn lex_tokens() -> Result<Vec<Token>, Error> { let mut p = parse("-a -bc value --long"); let mut tokens = Vec::with_capacity(p.arg_count()); while let Some(arg) = p.next()? { match arg { Short(val) => tokens.push(val.try_into()?), Long(val) => tokens.push(val.try_into()?), Value(val) => tokens.push(val.try_into()?), } } Ok(tokens) }

I don't think there's a workaround for from_args(), but with the from_env() case, users can just call:

Vec::with_capacity(std::env::args_os().len())

which honestly is not that bad, but it is unfortunate to have to use std::env::args_os() outside of the Parser. Of course, there is still no guarantee that all the tokens will fit, even with setting the capacity, since combined short options may produce multiple tokens from the same argument. But it's still nice to get an estimate.

arg_count() could be as simple as

pub fn arg_count(&self) -> usize { self.source.len() }

But there would be some tradeoffs to consider:

source would have to become a dyn ExactSizeIterator

source: Box<dyn ExactSizeIterator<Item = OsString> + 'static>,

from_args() would have to ensure that its IntoIter type is exact size

pub fn from_args<I>(args: I) -> Parser where I: IntoIterator + 'static, I::Item: Into<OsString>, <I as IntoIterator>::IntoIter: ExactSizeIterator, {

This would be a breaking change; your parse() helper function for your tests would have to collect into something with an exact size first, because a SplitWhitespace is lazily evaluated.

fn parse(args: &'static str) -> Parser { Parser::from_args(args.split_whitespace().map(bad_string).collect::<Vec<_>>()) }

One benefit of this is that all iterators would be capped out at usize's max value, so overflow would no longer be an issue for the position functionality.

I will leave it up to you to determine if such a change would be in line with your vision for this project. I'm just trying to think from a perspective of reducing friction for users. Is it more convenient to be able to get the arg count natively from the API, or is it more convenient to parse from iterators of unknown size? I don't know, but I felt like asking the question.

Anyhow, feel free to close this issue if this is not in line with your vision, and at least the decision will be documented here. If you keep the current behavior, it may be worth documenting how to get the size separately from std::env:args_os()
opened by nordzilla 2
Incorrect "missing argument at end of command" error
This code:

fn main() -> Result<(), lexopt::Error> { let mut parser = lexopt::Parser::from_env(); parser.values()?; Ok(()) }

Has this behavior:

$ cargo run -- -a b Error: missing argument at end of command

This is Error::MissingValue's message if Parser doesn't remember an option. In 0.1.0 it was only returned by Parser::value(), at the end of the command line, but Parser::values() may return it if it encounters an option.

An easy fix would be to remove "at end of command" from the message, but maybe there's a better way out.
opened by blyxxyz 1
Getting `RawArgs` without advancing the iterator
Hi! I'm writing a library on top of lexopt and I'd like to peek the next argument without advancing the optional value, because I need to support some weird arguments that don't follow the usual rules.

Essentially I'd like a method like this, which gives me the RawArgs if there are no more shorts or long values to take:

pub fn try_raw_args(&mut self) -> Option<RawArgs<'_>> { if self.long_value.is_some() || self.shorts.is_some() { return None; } Some(RawArgs(&mut self.source)) }

With this method, I can make a loop like this, where the check for the custom argument does not interfere with the normal parsing:

loop { // Check the weird custom argument if let Some(raw) = parser.try_raw_args() { ... } // Parse normal argument let Some(arg) = parser.next() else { break; }; ... }
opened by tertsdiepraam 5
Please provide way to parse `-o=value` as option taking value `=value`

Hi. I saw this crate mentioned in a blog post and I like the idea. But there is a difficulty:

Argument parsers should be transparent as much as possible. Currently, becuase lexopt supports -o=value to mean "set to value", to unparse an unknown string (like a user-provided filename) it is necessary to always pass the =.

(The situation with short options is different to long options: supporting --long=value is completely unambiguous and simply a good idea.)

IMO the = is unnatural here. I'm not aware of many other programs which treat -o=value as setting the option to value. Almost all (including for example all the POSIX utilities) treat it as setting the option to =value. See eg the Utility Convension in the Single Unix Specification, or the manpage getopt(3)

And as I point out, handling = specially for short option values is not a cost-free feature (unlike with long options): it changes the meaning of existing programs. A shell script which takes some arguments and runs a lexopt-using Rust program, and passes filenames to it, must write the = or risk malfunctioning on filenames which start with=. Because the = is unconventional with a short option, the author of such a script probably won't have written it, so the script will probably have this bug.

And, within an existing Rust project, switching from another option parsing library to lexopt is hazardous because it will change the meaning of command lines which are accepted by both.

Could you please provide an option to allow lexopt to be used without this = on short option feature? I'm not sure if that would involve a ParserBuilder or whether you could just make it a configuration method on Parser.

Personally I think the default ought to be to not handle = specially in short options but that would be a breaking change.

Thanks for your attention.

opened by ijackson 3
.into_string() does not work well with other error types
Some programs parse inside a function that returns an anyhow-style error type or a custom error type. This doesn't play well with OsString::into_string(): its error type (OsString) can convert to anyhow::Error but not to most other error types.

rres uses .into_string().unwrap().

minoru-fediverse-crawler uses .into_string().map_err(|ostr| anyhow!("{}", ostr.to_string_lossy()))?.

Another workaround would be .into_string().map_err(lexopt::Error::from)?.

Yet another is .parse()?. This performs a copy.

A ValueExt::string(self) -> Result<String, lexopt::Error> method would solve this. (There may be a better name.)

Error's From<OsString> impl should then be kept, except perhaps in a release that already breaks most code for other reasons. But it should be removed from the README.
opened by blyxxyz 1
Arg borrowing from Parser is inconvenient
When creating new abstractions on top of lexopt (like subcommands) it's annoying that Arg borrows from Parser. It means you can't call .value() or pass the parser to another function until you've unpacked the Arg.

The borrowing is necessary to be able to match on Arg::Long in an ergonomic way. You can match a &str against a string literal but not a String. So the Parser owns the string and the Arg just borrows it.

If it ever becomes possible to match a String against a string literal then it might be worth releasing a new backward-incompatible version of lexopt to take advantage of that.

In the meantime there's an ugly trick you can use to decouple the Arg from the Parser:

fn unleash_arg<'a, 'b>( storage: &'a mut Option<String>, arg: lexopt::Arg<'b>, ) -> lexopt::Arg<'a> { use lexopt::Arg::*; match arg { Long(opt) => { *storage = Some(opt.to_owned()); Long(storage.as_deref().unwrap()) } Short(opt) => Short(opt), Value(val) => Value(val), } } let mut storage = None; let arg = unleash_arg(&mut storage, arg);

unleash_arg could be made into a method on Arg, but there might be a better solution.

(See also: #3)
opened by blyxxyz 2

Releases(v0.2.1)

v0.2.1(Jul 10, 2022)
New:

Add Parser::raw_args() for collecting raw unparsed arguments. (#12)

Implement Debug for ValuesIter.

Bug fixes:

Change "missing argument at end of command" error message. (#11)

Source code(tar.gz)
Source code(zip)
v0.2.0(Oct 23, 2021)
While this release is not strictly backward-compatible it should break very few programs.

New:

Add Parser::values() for options with multiple arguments.

Add Parser::optional_value() for options with optional arguments.

Add Parser::from_iter() to construct from an iterator that includes the binary name. (#5)

Document how to use Parser::value() to collect all remaining arguments.

Changes:

Support = as a separator for short options (as in -o=value). (#18)

Sanitize the binary name if it's invalid unicode instead of ignoring it.

Make Error::UnexpectedValue.option a String instead of an Option<String>.

Bug fixes:

Include bin_name in Parser's Debug output.

Source code(tar.gz)
Source code(zip)