A rust implementation of the csl-next model.

Overview

Vision

This is a project to write the CSL-Next typescript model and supporting libraries and tools in Rust, and convert to JSON Schema from there.

At a high-level, the vision of the project is to:

  1. Adapt what we've learned in almost 20 years of experience with CSL 1.0 to modern programming idioms and formats.
  2. Simplify the template part of the language, and put more, and extensible, logic in option groups, so it's easier to work with for users, style editors, and developers alike.
  3. Add new features while we're at it, like multi-lingual support, advanced dates and times, narrative citations, and so forth.
  4. Align code and schemas by generating the latter from the former, and so also provide a common meeting point for developers and domain experts.

More concretely, the goal is a suite of models, libraries and tools that make extremely performant advanced citation and bibliography processing available everywhere:

  • desktop and web
  • batch-processing for formats like pandoc markdown, djot, LaTeX, and org-mode
  • interactive real-time processing for GUI contexts like Zotero
  • easy-to-use style creation wizards, both command-line and web

Principles

For the Style model:

  1. Keep the template language as simple as possible, in the hopes we can keep it stable going forward, while still enabling innnovation.
  2. Add new functionality primarily via option groups.

For the InputReference and Citation models:

  1. No string-parsing, with the sole exception of the EDTF date format, which is now ISO-standardized as an extension profile of ISO 8601, with well-defined parsing rules, and parsing libraries available in multiple languages.
  2. Provide structure where needed, but offer alternatives where not. EDTF is available for diverse date-time encoding, but dates fields will fallback to a plain string. Likewise, the Contributor model offers similar flexibility, and power where needed.

The model

Influences

  1. The CSL 1.0 specification options, and its template language (aka layout and rendering elements), most notably from names, dates, and other formatting.
  2. Patterns observed in the CSL 1.0 styles repository.
  3. The BibLaTeX preamble options.

Comparison to CSL 1.0 and BibLaTeX

To understand the difference between this model and CSL 1.0, look at style::options. There, you will note configuration options for many details that in CSL 1.0 are configured within the template language:

  • dates
  • contributors
  • substitution

Plus, I've added localization support as such a configuration option group, with the idea it can be more easily-expanded there, than by burdening the template language with those details.

In that sense, this design is closer to BibLaTeX, which has a very long list of flat options that handle much of the configuration. Like that project, here we standardize on EDTF dates.

Project Organization

I've separated the code into discrete crates, with the intention to ultimately publish them.

I'm hoping to have demonstrated enough so far that this is a promising direction for the future of CSL, at least on the technical end, that folks might be willing to help build this out. Ideally, I want to develop one or both of these projects sufficiently to move them to the GitHub CSL org for further development and future maintenance. Doing so, however, will require sorting out details of how that process is managed and funded.

Contribution

I would love to have help on this, both because I'm an amateur programmer and a Rust newbie, and because the vision I am sketching out here will take a lot of work to realize.

Please contact me via discussions or the issue tracker, or by email, if you'd like to contribute.

I licensed the code here under the same terms as citeproc-rs, in case code might be shared between them. I also understand the Mozilla 2.0 license is compatible with Apache.

A note on citeproc-rs:

In reviewing the code, it strikes me pieces of it obviously complement this code base. In particular, it has been optimized for the Zotero use-case, where it provides real-time formatting, while I have focused of the batch-processing case.

Comments
  • Refactor sort config

    Refactor sort config

    Per discussion on a CSL 1.0 test, some styles require author sort keys to be shortened as they are for display.

    https://github.com/citation-style-language/test-suite/issues/60

    So a vector/list of sort structs isn't enough, or the sort struct needs another parameter.

    In looking through the style repo with the blunt instrument of ripgrep, here's some conclusions:

    1. by far the most common primary sorting key is author, which is subject to shortening for display, and to substitution when not present
    2. per the linked test, some styles want you to sort on the shortened list
    3. almost without exception, substitutions for author are editor, title, translator.
    4. two of those substitutions are contributor/names lists, so also subject to shortening

    I've already extracted substitution and name list shortening to top-level config options.

    So I think a small change like the following should work?

    sort:
      shorten_author: true
      specs:
        - key: author
        - key: issuedYear
          order: descending
    

    ... or even:

    sort:
      shorten_author: true
      keys:
        - author
        - issued.year-descending
    

    Here's how biblatex does it, which is similar to my last option, but simpler, yet more options (see section 3.1.2 in general):

    image

    But note that it has minsortnames and variants, which means it's not just a boolean controlling the linked case. From the manual:

    The first item considered in the sorting process is always the presort field of
    the entry. If this field is undefined, biblatex will use the default value ‘mm’ as
    a presort string. The next item considered is the sortkey field. If this field is
    defined, it serves as the master sort key. Apart from the presort field, no further
    data is considered in this case. If the sortkey field is undefined, sorting continues
    with the name. The package will try using the sortname, author, editor, and
    translator fields, in this order. Which fields are considered also depends on the
    setting of the use<name> options. If all such options are disabled, the sortname
    field is ignored as well. Note that all name fields are responsive to maxnames and
    minnames. If no name field is available, either because all of them are undefined
    or because all use<name> options are disabled, biblatex will fall back to the
    sorttitle and title fields as a last resort. The remaining items are, in various
    order: the sortyear field, if defined, or the first four digits of the year field
    otherwise; the sorttitle field, if defined, or the title field otherwise; the
    volume field.
    

    So do something like they did, it might:

    sort:
      rules: nty
      shorten:
        min: 4
        take: 2
    

    E.g. would need to allow shorten in multiple places.

    opened by bdarcus 2
  • Broken sorting

    Broken sorting

    With #50, I somehow I broke sorting such that it reverses the order of processing.

    Debugging output suggests it's correct, but the results say otherwise.

    [Sort { key: Author, order: Ascending }, Sort { key: Year, order: Ascending }]
    a_author: brown, david, lee, jane
    

    Spent a long time try to fix it, but gave up for now.

    opened by bdarcus 1
  • Contributor modeling

    Contributor modeling

    I'm playing with this on the #50 branch, but it's tricky.

    We need to support:

    1. people and orgs
    2. sorting and displaying
    3. localized differences, including of 2, and of things like participles
    4. independent formatting of people name parts
    5. for it all ideally to be easy-to-use in the input format

    Right now it's just:

    author: ["Doe, Jane"]
    parse: true, # default
    

    One idea that just occurred to me, that I need think on more, is something like this:

    pub enum Contributor {
        SimpleName(Vec<String>),
        StructuredName(Vec<Vec<String>>),
    }
    

    So then the first option would be as now, and assume sort and display are the same.

    author: ["Mao Zedong"]
    

    The second would add more structure; at minimum:

    author: [ ["Doe", "Jane", "de"] ]
    

    So there, the sort string would be derived by joining the items in the list, for example, and display would reorder them.

    opened by bdarcus 1
  • Add options to Render trait

    Add options to Render trait

    Need to pass the config options to these.

    It may be it remains one parameter, but with four pieces:

    • global
    • citation
    • bibliography
    • component

    ... with a separate function to determine which to apply in a given context.

    opened by bdarcus 1
  • Use rayon for processor methods

    Use rayon for processor methods

    Way too premature, but worth thinking about now:

    The most expensive operations will be sorting, grouping, etc.

    But performance isn't that critical in the sort of batch-processing scenarios I tend to focus on; for example, processing markdown or tex documents from the command line.

    Maybe there should be two modes, then: batch and interactive/server?

    On the latter, it would be good to do some of these operations in parallel, and to perhaps cache the key data structures.

    Citeproc-rs does this using rayon and salsa, so maybe look at that?

    EDIT: I did add parallel sorting very easily. I need to add a benchmark so I can compare the results.

    opened by bdarcus 1
  • Render cleanup

    Render cleanup

    Basic rendering now works so:

    1. [X] add it to CLI
    2. [X] fix render_references so it returns an array of templates (also need to add a type for that)
    3. [X] fix what seems to be a input date format issue (see error below)
    4. [X] fill out the example style to demonstrate and test better.

    Tackle these later:

    1. [ ] add edtf:Date formatting
    2. [ ] add disambiguation from hints
    3. [ ] I think I want to remove the outer templateComponent property in the AST, but may do later
        {
          "templateComponent": {
            "contributor": "author",
            "form": "long",
            "rendering": null
          },
          "value": "United Nations"
        }
    

    ❯ target/debug/csln processor/examples/style.csl.yaml processor/examples/ex1.bib.yaml
    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError(TooShort)', bibliography/src/reference.rs:101:82
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    

    On EDTF, here's what the parse function returns, so will presumably need to match on this?

    pub enum Edtf {
        DateTime,
        Date,
        YYear,
        Interval,
        IntervalFrom,
        IntervalTo,
    }
    

    I think I then need to use chrono to do the formatting, but I'm not sure:

    https://docs.rs/chrono/latest/chrono/#formatting-and-parsing

    opened by bdarcus 1
  • Add group function

    Add group function

    Create three functions.

    First, make_group_key:

    Take a vector StyleGroupKey and concatenate the values returned from the respective field data using string_for_key.

    Second, string_for_key:

    When given a key return the string value from ProcReference::data.

    Third, group_proc_references:

    Group the vector returned by Processor::get_proc_references and add a ProcHint to each, consisting of group_index, group_length, and group_key; return a new vector of ProcReference.

    Finally, add a test to processor_test.rs to confirm correct ProcHint::group_index, values using the examples/bibliography.yaml.

    opened by bdarcus 1
  • refactor(options/sort): vector -> object

    refactor(options/sort): vector -> object

    To make room for additional configuration options for sorting, move the template to a named template field, and add a couple of example parameters.

    Refs: #64


    If you're curious, @plk, here's the commit for adjusting the sort config per discussion.

    sort:
      template:
        - author
        - issued-year
    

    Much of the magic will happen with default values.

    Depending on how complex it gets, I may add later a similar ability to reference pre-defined configs via a key.

    sort: abc
    
    opened by bdarcus 0
  • edtf and localized date formatting

    edtf and localized date formatting

    Made progress on this, but stuck on what seems one last detail: passing the right data types to the formatting method.

    I've spent hours trying to figure this out, without luck!

    Also, I'm giving up this again.

    opened by bdarcus 0
  • Simplify make-key

    Simplify make-key

    No need for this here; just pass the enum option directly:

    https://github.com/bdarcus/csln/blob/38645446c0ab8f5c20f5cb77a5fbebbdc44e1f0e/processor/src/lib.rs#L509-L512

    opened by bdarcus 0
  • refactor: add csln-types crate

    refactor: add csln-types crate

    An experiment ATM, that I'm merging for now. But it's not hooked up, and it needs work.

    Adapt a few ideas, and some code, from Hayagriva, and separate out basic data types and formatting code from bibliography and processor.

    I think ideally I want all the basic formatting of these datatypes to be here too, so the models become clean, and also easier to contribute to for non-programmers or rust devs, and it's all more modular.

    For titles, adapted this from the ts model, which in turn is adapted from the CSL 1.1 branch xml schema.

    export interface TitleStructured {
      full?: TitleString;
      main: TitleString;
      sub: TitleString[];
    }
    

    Still unsure on traits vs straight functions, etc. Masto response back-and-forth is very useful.

    https://fosstodon.org/@digikata/110515566415939721

    Here's one of the suggested examples to examine:

    https://github.com/obsidiandynamics/stanza/blob/master/src/renderer.rs

    And the markdown renderer:

    https://github.com/obsidiandynamics/stanza/blob/master/src/renderer/markdown.rs

    https://docs.rs/tabled/latest/tabled/

    May want to consider this on sorting:

    https://stackoverflow.com/questions/60916194/how-to-sort-a-vector-in-descending-order-in-rust/60916195#60916195

    opened by bdarcus 0
  • edtf date-time formatting

    edtf date-time formatting

    Like my fifth attempt at this ....

    I've decided to add my own methods to parse edtf date-times, and also convert them to the sort of input formats needed by chrono or icu.

    The first commit here adds those.

    The parsing method returns an Edtf::Date regardless of whether it's valid, but if it's not, it's a dummy date with zero year.

    Maybe there's a better way, but it'll do for now.

    But this currently ONLY handles edtf:Date, so still work to do.

    opened by bdarcus 0
  • Multilingual support

    Multilingual support

    Seems the biblatex and latex world is a good place to look for how to do this.

    First, this issue at their tracker:

    https://github.com/plk/biblatex/issues/416

    Second, this thread;:

    https://tex.stackexchange.com/a/505649

    It seems the minimum you need is a language field on entries.

    @book{Sharoni1969,
     author = {ميخائيل، ملاك  and  الشاروني، حبيب},
     date = {1969},
     title = {المرجع فى قواعد اللغة القبطية},
     location = {الاسكندرية},
     publisher = {جمعية مارمينا العجايبي},
     langid = {arabic}
    }
    @book{Browning1983,
     author = {Browning, Robert},
     date = {1983},
     title = {Medieval and Modern Greek},
     publisher = {Cambridge University Press},
     langid = {english}
    }
    

    But that doesn't go far enough:

    @misc{CBible2015,
     date = {2015},
     title = {\foreignlanguage{english}{Coptic Bible} الكتاب المقدس القبطي},
     langid = {arabic}
    }
    

    Rendering of that example:

    image

    So I think we also need to allow langids to be attached to alternate title fields.

    opened by bdarcus 2
  • Borrow from biblatex

    Borrow from biblatex

    See also #61

    Beyond CSL, the other excellent package first released around the same time, and similarly ambitious, is biblatex.

    It has struck me its design has some similarities to what I'm doing here.

    Consider their long list of completely flat parameters, aka options (and see table I've attached below for how they map to scopes):

    image

    They've also been ahead of us on EDTF, and looks like already figured it out.

    image

    biblatex-options-table.pdf

    opened by bdarcus 13
  • Contributor names, again

    Contributor names, again

    I think I settled on a good foundation in #51, but left out a few things.

    Maybe we should look at biblatex extended names (see p15 of manual):

    @inproceedings{Hasselt2016,
      author    = {family=Hasselt, given=Hado P., prefix=van useprefix=false and Guez, Arthur and Hessel, Matteo and Mnih, Volodymyr and Silver, David},
    }
    

    Of note:

    • prefix (aka CSL particle?)
    • suffix
    • useprefix (aka CSL non-dropping-particle?)
    • can also specify initials by appending -i

    See also #64.

    opened by bdarcus 0
  • Review, revise, add documentation to `style::options`

    Review, revise, add documentation to `style::options`

    Need to compare the csl-rs repo options side-by-side with these, and adjust as needed, add docs.

    The idea is to transfer, and as needed adjust, the language from the 1.0 style spec.

    opened by bdarcus 0
Owner
Bruce D'Arcus
Bruce D'Arcus
WebAssembly component model implementation for any backend.

wasm_component_layer wasm_component_layer is a runtime agnostic implementation of the WebAssembly component model. It supports loading and linking WAS

Douglas Dwyer 11 Aug 28, 2023
Docker for PyTorch rust bindings `tch`. Example of pretrain model.

tch-rs-pretrain-example-docker Docker for PyTorch rust bindings tch-rs. Example of pretrain model. Docker files support the following install libtorch

vaaaaanquish 5 Oct 7, 2022
Experimenting with Rust's fundamental data model

ferrilab Redefining the Rust fundamental data model bitvec funty radium Introduction The ferrilab project is a collection of crates that provide more

Rusty Bit-Sequences 13 Dec 13, 2022
Library for the Standoff Text Annotation Model, in Rust

STAM Library STAM is a data model for stand-off text annotation and described in detail here. This is a sofware library to work with the model, writte

annotation 3 Jan 11, 2023
A Voice Activity Detector rust library using the Silero VAD model.

Voice Activity Detector Provides a model and extensions for detecting speech in audio. Standalone Voice Activity Detector This crate provides a standa

Nick Keenan 3 Apr 3, 2024
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

Synerise 405 Dec 20, 2022
Masked Language Model on Wasm

Masked Language Model on Wasm This project is for OPTiM TECH BLOG. Please see below: WebAssemblyを用いてBERTモデルをフロントエンドで動かす Demo Usage Build image docker

OPTiM Corporation 20 Sep 23, 2022
This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model

RAMP from scratch This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model, based on the EcoTwins-withCommuting branch, in Rus

Dustin Carlino 3 Oct 26, 2022
A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

ph04 2 Apr 1, 2022
m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

Bayes' Witnesses 2.3k Dec 31, 2022
Using OpenAI Codex's "davinci-edit" Model for Gradual Type Inference

OpenTau: Using OpenAI Codex for Gradual Type Inference Current implementation is focused on TypeScript Python implementation comes next Requirements r

Gamma Tau 11 Dec 18, 2022
Your one stop CLI for ONNX model analysis.

Your one stop CLI for ONNX model analysis. Featuring graph visualization, FLOP counts, memory metrics and more! ⚡️ Quick start First, download and ins

Christopher Fleetwood 20 Dec 30, 2022
A demo repo that shows how to use the latest component model feature in wasmtime to implement a key-value capability defined in a WIT file.

Key-Value Component Demo This repo serves as an example of how to use the latest wasm runtime wasmtime and its component-model feature to build and ex

Jiaxiao Zhou 3 Dec 20, 2022
Believe in AI democratization. llama for nodejs backed by llama-rs, work locally on your laptop CPU. support llama/alpaca model.

llama-node Large Language Model LLaMA on node.js This project is in an early stage, the API for nodejs may change in the future, use it with caution.

Genkagaku.GPT 145 Apr 10, 2023
Rust implementation of real-coded GA for solving optimization problems and training of neural networks

revonet Rust implementation of real-coded genetic algorithm for solving optimization problems and training of neural networks. The latter is also know

Yury Tsoy 19 Aug 11, 2022
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 135 Dec 24, 2022
A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

null 89 Dec 23, 2022
A neural network, and tensor dynamic automatic differentiation implementation for Rust.

Corgi A neural network, and tensor dynamic automatic differentiation implementation for Rust. BLAS The BLAS feature can be enabled, and requires CBLAS

Patrick Song 20 Nov 7, 2022
Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

Milan Boers 60 Dec 29, 2022