UNIC: Unicode and Internationalization Crates for Rust

open-i18n — Open Internationalization Initiative

Last update: Nov 12, 2022

Related tags

Learning Resources rust unicode internationalization cldr crates unicode-characters text-processing unic locale-data unicode-algorithms

Overview

UNIC: Unicode and Internationalization Crates for Rust

https://github.com/open-i18n/rust-unic

UNIC is a project to develop components for the Rust programming language to provide high-quality and easy-to-use crates for Unicode and Internationalization data and algorithms. In other words, it's like ICU for Rust, written completely in Rust, mostly in safe mode, but also benefiting from performance gains of unsafe mode when possible.

See UNIC Changelog for latest release details.

Project Goal

The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

Other standards and best practices, like IETF RFCs, are also implemented, as needed by Unicode/CLDR components, or common demand.

Project Status

At the moment UNIC is under heavy development: the API is updated frequently on master branch, and there will be API breakage between each 0.x release. Please see open issues for changes planed.

We expect to have the 1.0 version released in 2018 and maintain a stable API afterwards, with possibly one or two API updates per year for the first couple of years.

Design Goals

Primary goal of UNIC is to provide reliable functionality by way of easy-to-use API. Therefore, new components are added may not be well-optimized for performance, but will have enough tests to show conformance to the standard, and examples to show users how they can be used to address common needs.
Next major goal for UNIC components is performance and low binary and memory footprints. Specially, optimizing runtime for ASCII and other common cases will encourage adaptation without fear of slowing down regular development processes.
Components are guaranteed, to the extend possible, to provide consistent data and algorithms. Cross-component tests are used to catch any inconsistency between implementations, without slowing down development processes.

Components and their Organization

UNIC Components have a hierarchical organization, starting from the unic root, containing the major components. Each major component, in turn, may host some minor components.

API of major components are designed for the end-users of the libraries, and are expected to be extensively documented and accompanies with code examples.

In contrast to major components, minor components act as providers of data and algorithms for the higher-level, and their API is expected to be more performing, and possibly providing multiple ways of accessing the data.

The UNIC Super-Crate

The unic super-crate is a collection of all UNIC (major) components, providing an easy way of access to all functionalities, when all or many are needed, instead of importing components one-by-one. This crate ensures all components imported are compatible in algorithms and consistent data-wise.

Main code examples and cross-component integration tests are implemented under this crate.

Major Components

unic-char: Unicode Character Tools.
unic-ucd: Unicode Character Database (UAX#44).
unic-bidi: Unicode Bidirectional Algorithm (UAX#9).
unic-normal: Unicode Normalization Forms (UAX#15).
unic-segment: Unicode Text Segmentation Algorithms (UAX#29).
unic-idna: Unicode IDNA Compatibility Processing (UTS#46).
unic-emoji: Unicode Emoji (UTS#51).

Applications

unic-cli: UNIC Command-Line Tools

Code Organization: Combined Repository

Some of the reasons to have a combined repository these components are:

Faster development. Implementing new Unicode/i18n components very often depends on other (lower level) components, which in turn may need adjustments—expose new API, fix bugs, etc—that can be developed, tested and reviewed in less cycles and shorter times.
Implementation Integrity. Multiple dependencies on other components mean that the components need to, to some level, agree with each other. Many Unicode algorithms, composed from smaller ones, assume that all parts of the algorithm is using the same version of Unicode data. Violation of this assumption can cause inconsistencies and hard-to-catch bugs. In a combined repository, it's possible to reach a better integrity during development, as well as with cross-component (integration) tests.
Pay for what you need. Small components (basic crates), which cross-depend only on what they need, allow users to only bring in what they consume in their project.
Shared bootstrapping. Considerable amount of extending Unicode/i18n functionalities depends on converting source Unicode/locale data into structured formats for the destination programming language. In a combined repository, it's easier to maintain these bootstrapping tools, expand coverage, and use better data structures for more efficiency.

Documentation

Unicode and Rust
UNIC Versioning
UNIC Unicode API
UNIC API Guideline
UNIC API Reference (autogenerated on docs.rs)

How to Use UNIC

In Cargo.toml:

[dependencies]
unic = "0.9.0"  # This has Unicode 10.0.0 data and algorithms

And in main.rs:

extern crate unic;

use unic::ucd::common::is_alphanumeric;
use unic::bidi::BidiInfo;
use unic::normal::StrNormalForm;
use unic::segment::{GraphemeIndices, Graphemes, WordBoundIndices, WordBounds, Words};
use unic::ucd::normal::compose;
use unic::ucd::{is_cased, Age, BidiClass, CharAge, CharBidiClass, StrBidiClass, UnicodeVersion};

fn main() {

    // Age

    assert_eq!(Age::of('A').unwrap().actual(), UnicodeVersion { major: 1, minor: 1, micro: 0 });
    assert_eq!(Age::of('\u{A0000}'), None);
    assert_eq!(
        Age::of('\u{10FFFF}').unwrap().actual(),
        UnicodeVersion { major: 2, minor: 0, micro: 0 }
    );

    if let Some(age) = '🦊'.age() {
        assert_eq!(age.actual().major, 9);
        assert_eq!(age.actual().minor, 0);
        assert_eq!(age.actual().micro, 0);
    }

    // Bidi

    let text = concat![
        "א",
        "ב",
        "ג",
        "a",
        "b",
        "c",
    ];

    assert!(!text.has_bidi_explicit());
    assert!(text.has_rtl());
    assert!(text.has_ltr());

    assert_eq!(text.chars().nth(0).unwrap().bidi_class(), BidiClass::RightToLeft);
    assert!(!text.chars().nth(0).unwrap().is_ltr());
    assert!(text.chars().nth(0).unwrap().is_rtl());

    assert_eq!(text.chars().nth(3).unwrap().bidi_class(), BidiClass::LeftToRight);
    assert!(text.chars().nth(3).unwrap().is_ltr());
    assert!(!text.chars().nth(3).unwrap().is_rtl());

    let bidi_info = BidiInfo::new(text, None);
    assert_eq!(bidi_info.paragraphs.len(), 1);

    let para = &bidi_info.paragraphs[0];
    assert_eq!(para.level.number(), 1);
    assert_eq!(para.level.is_rtl(), true);

    let line = para.range.clone();
    let display = bidi_info.reorder_line(para, line);
    assert_eq!(
        display,
        concat![
            "a",
            "b",
            "c",
            "ג",
            "ב",
            "א",
        ]
    );

    // Case

    assert_eq!(is_cased('A'), true);
    assert_eq!(is_cased('א'), false);

    // Normalization

    assert_eq!(compose('A', '\u{030A}'), Some('Å'));

    let s = "ÅΩ";
    let c = s.nfc().collect::<String>();
    assert_eq!(c, "ÅΩ");

    // Segmentation

    assert_eq!(
        Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
        &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
    );

    assert_eq!(
        Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
        &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
    );

    assert_eq!(
        GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
        &[(0, "a̐"), (3, "é"), (6, "ö̲"), (11, "\r\n")]
    );

    assert_eq!(
        Words::new(
            "The quick (\"brown\") fox can't jump 32.3 feet, right?",
            |s: &&str| s.chars().any(is_alphanumeric),
        ).collect::<Vec<&str>>(),
        &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
    );

    assert_eq!(
        WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
        &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
    );

    assert_eq!(
        WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
        &[
            (0, "Brr"),
            (3, ","),
            (4, " "),
            (5, "it's"),
            (9, " "),
            (10, "29.3"),
            (14, "°"),
            (16, "F"),
            (17, "!")
        ]
    );
}

You can find more examples under examples and tests directories. (And more to be added as UNIC expands...)

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Code of Conduct

UNIC project follows The Rust Code of Conduct. You can find a copy of it in CODE_OF_CONDUCT.md or online at https://www.rust-lang.org/conduct.html.

Comments

Add new unic-ucd-common component

There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are Alphabetic and White_Space.

Also, there are some resolved properties that are used commonly, like Numeric and Alphanumeric, which are commonly defined as based on General_Category and Alphabetic properties. Since they are common in applications, it makes sense to provided optimized implementations.

This new componet, unic-ucd-common, hosts these properties.
C: ucd A: lib-impl

opened by behnam 28
UCD Age: Custom enum type or Option?
Right now, we have Age as this:

https://github.com/behnam/rust-unic/blob/72ca9a893373f1857bc6ab6440389dfdceea6f13/unic/ucd/age/src/age.rs#L36-L42

It's nice to have this CompleteCharProperty, as it gives meaningful API like is_assigned() and is_unassigned().

Another option is to convert this to a PartialCharProperty, which returns None if the char is Unassigned, and Some(UnicodeVersion) otherwise.

One more option would be to have it as PartialCharProperty, but the return type will be more following the UAX#44 spec: only having major and minor numbers. Basically, return value would be Some(Age), with Age being similar to UnicodeVersion, but without minor number. Then, we can have transformations between Age and UnicodeVersion.

Option (3) makes it much more like other types in UCD, option (2) is a bit farther, and option (1), which is the current one, is the farthest from general approach here.

BUT, there's also the fact that option (1) looks nicer and more organic comparing to the other ones.

I'm filing this mostly because I need to figure out a way to provide a property-range contract for Age, and would be great if it's not just an empty contract.

Any ideas?
C: ucd discussion
opened by behnam 13
Implement Unicode Name - NR2
This is a work in progress for implementing Unicode Name - NR2 (#171).

[x] NR2: CJK Unified Ideograph / Tangut Ideograph / Nushu Character / CJK Compatibility Ideograph

[x] Add tests

Feedback is welcome!
opened by eyeplum 12
Replace unnecessary display methods with to_string.

The former implementation adds an unnecessary allocation to the Display impl. Because implementing Display gives the to_string method for free, there shouldn't be any Display methods.

If there are any methods which offer strings that differ from the Display implementation, those should have different names. I deprecated the methods instead of removing them, just to be safe.
C: utils A: lib-api

opened by clarfonthey 11
Initial implementation of UCD Segmentation properties
Add UCD Segmentation source data to /data/, implement conversion from new files to property map data tables.

Add unic-ucd-segment component with initial implementation of three main segmentation-related properties:

Grapheme_Cluster_Break,

Word_Break, and

Sentence_Break.

Current implementation uses char_property!() macro for EnumeratedCharProperty implementation, which only supports TotalCharProperty.

Since the Other (abbr: XX) value in all these properties are notions of non-existance of breaking property, we want to switch to PartialCharProperty domain type and use Option<enum>. This is left as a separate step because it needs changes to the macro.

Tracker: https://github.com/behnam/rust-unic/issues/135
C: segmentation C: ucd
opened by behnam 11
CharProperty::display() ?

Do we want to add display() as an instance method to CharProperty API? We already have impl fmt::Display for it, but returning &str would be useful in cases that the display string is not expected to be matched with formatting.

I have seen both cases in third-party libraries.

What do you think?

opened by behnam 11

Char property macro 2.0

Replaces #41. See #41 for earlier discussion.

An example will show better than I can tell:

char_property! {
    /// Represents the Unicode character
    /// [*Bidi_Class*](http://www.unicode.org/reports/tr44/#Bidi_Class) property,
    /// also known as the *bidirectional character type*.
    ///
    /// * <http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types>
    /// * <http://www.unicode.org/reports/tr44/#Bidi_Class_Values>
    pub enum BidiClass {
        /// Any strong left-to-right character
        ///
        /// ***General Scope***
        ///
        /// LRM, most alphabetic, syllabic, Han ideographs,
        /// non-European or non-Arabic digits, ...
        LeftToRight {
            abbr => L,
            long => Left_To_Right,
            display => "Left-to-Right",
        }

        /// Any strong right-to-left (non-Arabic-type) character
        ///
        /// ***General Scope***
        ///
        /// RLM, Hebrew alphabet, and related punctuation
        RightToLeft {
            abbr => R,
            long => Right_To_Left,
            display => "Right-to-Left",
        }

        /// Any strong right-to-left (Arabic-type) character
        ///
        /// ***General Scope***
        ///
        /// ALM, Arabic, Thaana, and Syriac alphabets,
        /// most punctuation specific to those scripts, ...
        ArabicLetter {
            abbr => AL,
            long => Arabic_Letter,
            display => "Right-to-Left Arabic",
        }
    }
}

/// Abbreviated name bindings for the `BidiClass` property
pub mod abbr_names for abbr;
/// Name bindings for the `BidiClass` property as they appear in Unicode documentation
pub mod long_names for long;

expands to:

/// Represents the Unicode character
/// [*Bidi_Class*](http://www.unicode.org/reports/tr44/#Bidi_Class) property,
/// also known as the *bidirectional character type*.
///
/// * <http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types>
/// * <http://www.unicode.org/reports/tr44/#Bidi_Class_Values>
#[allow(bad_style)]
#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub enum BidiClass {
    /// Any strong left-to-right character
    LeftToRight,
    /// Any strong right-to-left (non-Arabic-type) character
    RightToLeft,
    /// Any strong right-to-left (Arabic-type) character
    ArabicLetter,
}
/// Abbreviated name bindings for the `BidiClass` property
#[allow(bad_style)]
pub mod abbr_names {
    pub use super::BidiClass::LeftToRight as L;
    pub use super::BidiClass::RightToLeft as R;
    pub use super::BidiClass::ArabicLetter as AL;
}
/// Name bindings for the `BidiClass` property as they appear in Unicode documentation
#[allow(bad_style)]
pub mod long_names {
    pub use super::BidiClass::LeftToRight as Left_To_Right;
    pub use super::BidiClass::RightToLeft as Right_To_Left;
    pub use super::BidiClass::ArabicLetter as Arabic_Letter;
}
#[allow(bad_style)]
#[allow(unreachable_patterns)]
impl ::std::str::FromStr for BidiClass {
    type Err = ();
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s {
            "LeftToRight" => Ok(BidiClass::LeftToRight),
            "RightToLeft" => Ok(BidiClass::RightToLeft),
            "ArabicLetter" => Ok(BidiClass::ArabicLetter),
            "L" => Ok(BidiClass::LeftToRight),
            "R" => Ok(BidiClass::RightToLeft),
            "AL" => Ok(BidiClass::ArabicLetter),
            "Left_To_Right" => Ok(BidiClass::LeftToRight),
            "Right_To_Left" => Ok(BidiClass::RightToLeft),
            "Arabic_Letter" => Ok(BidiClass::ArabicLetter),
            _ => Err(()),
        }
    }
}
#[allow(bad_style)]
#[allow(unreachable_patterns)]
impl ::std::fmt::Display for BidiClass {
    fn fmt(&self, f: &mut ::std::fmt::Formatter) -> ::std::fmt::Result {
        match *self {
            BidiClass::LeftToRight => write!(f, "{}", "Left-to-Right"),
            BidiClass::RightToLeft => write!(f, "{}", "Right-to-Left"),
            BidiClass::ArabicLetter => write!(f, "{}", "Right-to-Left Arabic"),
            BidiClass::LeftToRight => write!(f, "{}", "Left_To_Right".replace('_', " ")),
            BidiClass::RightToLeft => write!(f, "{}", "Right_To_Left".replace('_', " ")),
            BidiClass::ArabicLetter => write!(f, "{}", "Arabic_Letter".replace('_', " ")),
            _ => {
                write!(
                    f,
                    "{}",
                    match *self {
                        BidiClass::LeftToRight => "L",
                        BidiClass::RightToLeft => "R",
                        BidiClass::ArabicLetter => "AL",
                        BidiClass::LeftToRight => "LeftToRight",
                        BidiClass::RightToLeft => "RightToLeft",
                        BidiClass::ArabicLetter => "ArabicLetter",
                    }
                )
            }
        }
    }
}
#[allow(bad_style)]
impl ::char_property::EnumeratedCharProperty for BidiClass {
    fn abbr_name(&self) -> &'static str {
        match *self {
            BidiClass::LeftToRight => "L",
            BidiClass::RightToLeft => "R",
            BidiClass::ArabicLetter => "AL",
        }
    }
    fn all_values() -> &'static [BidiClass] {
        const VALUES: &[BidiClass] = &[
            BidiClass::LeftToRight,
            BidiClass::RightToLeft,
            BidiClass::ArabicLetter,
        ];
        VALUES
    }
}

All three of the abbr, long, and display properties of the enum are optional, and have sane fallbacks: abbr_name and long_name return None if unspecified, and fmt::Display will check, in order, for display, long_name, abbr_name, and the variant name until it finds one to use (stringified, of course).

FromStr is defined, matching against any of the provided abbr, long, and variant name.

Important notes:

~~The current format uses associated consts, so it works on beta but won't work on stable until 1.20 is stable.~~
- Consts have a slightly different meaning than pub use -- pub use aliases the type where const is a new object and if used in pattern matching is a == call and not a pattern match.
- For this reason I'm actually slightly leaning towards using pub use even once associated consts land; they're compartmentalized (so use Property::* doesn't pull in 3x as many symbols as there are variants). After using the const based aliasing for a little bit, I'm inclined to like the current solution of unic::ucd::bidi::BidiClass::* + unic::ucd::bidi::bidi_class::abbr_names::*. These really should be a pub use and not a const.
- Note that I still think const are the way to go for cases like Canonical_Combining_Class, though.
~~The current syntax could easily be adapted to use modules instead of associated consts, but was written with the associated consts so we could get a feel of how it would look with them.~~
The zero-or-more meta match before a enum variant conflicts with the ident match before 1.20. See rust-lang/rust#42913, rust-lang/rust#24189
There only tests of the macro are rather thin and could be expanded.
It's a macro, so the response when you stick stuff not matching the expected pattern is cryptic at best.
The CharProperty trait is pretty much the lowest common denominator. It's a starting point, and we can iterate from there.
How and where do we want to make CharProperty a externally visible trait? Currently having it in namespace is the only way to access abbr_name and long_name.
~~Earlier discussion suggested putting these into unic::utils::char_property. Moving it would be simple, but for now it's living in the root of unic-utils~~
~~The crate unic-utils is currently in the workspace by virtue of being a dependency of unic, but is not in any way visible a crate depending on unic.~~
~~Documentation doesn't exist.~~

opened by CAD97 11

[char/property] [ucd] Macro update and use

I can submit a separate PR after 31e6582 if we want to pull the edited macro and the application separately.

The diff is kind of opaque because this touches so much. You're probably better off just comparing before/after the PR without looking at the diff output.

This PR is submitted in a ready-to-merge (after review) state rather than my more usual track-final-polish.

This will require a rebase of #118 as it touches many of the same parts of the code.
C: ucd C: utils

opened by CAD97 10
[char/range] Add CharRange and CharIter

The first half of adressing #91. Closes #111, this manner of attack is better than it.

This PR only has one type, CharRange. It is effectively std::ops::RangeInclusive translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for 'a'..='z' syntax.

opened by CAD97 10
WIP: [ucd] Implement Name char property

It's a WIP. One problem with it right now is that it increases the compile time by dozens of seconds, because of the optimization being done with the &[&'static str] table. Because of that, we may be better off with a manually implemented mapping, meaning that table generation would take some time, but component compilation stays fast.

opened by behnam 9
Expand components/ucd/tests/category_tests.rs

We have a cross-component test in components/ucd/tests/category_tests.rs that checks values of the Bidi_Class property against General_Category property, based on UAX#9's Table 4. Bidirectional Character Types.

Now that we have component/ucd/category, we can expand the test to also cover General_Category values.
help wanted C: ucd L: easy A: test

opened by behnam 9
unic-ucd-hangul crate does not contain any license files

Hello there! I was trying to create a package for Fedora when I realized that the hangul crate does not contain any license files. Both the MIT and Apache-2.0 license require that redistributed sources contain a copy of the license text.

opened by Laiot 0

Digits ('0', '1', etc.) are interpreted as emojis

Upon calling unic::emoji::char::is_emoji('0'), this library returns true. I'm not aware of the specifics of the unicode standard, but I believe that '0' is not an emoji.

This test may be useful to introduce:

#[test]
fn are_nums_emojis() {
    use unic::emoji::char::is_emoji;
    assert_eq!(is_emoji('0'), false);
    assert_eq!(is_emoji('1'), false);
    assert_eq!(is_emoji('2'), false);
    assert_eq!(is_emoji('3'), false);
    assert_eq!(is_emoji('4'), false);
    assert_eq!(is_emoji('5'), false);
    assert_eq!(is_emoji('6'), false);
    assert_eq!(is_emoji('7'), false);
    assert_eq!(is_emoji('8'), false);
    assert_eq!(is_emoji('9'), false);
}

opened by SirJosh3917 2

Forked library; and some thoughts about whether it's worth it to keep all modules at same Unicode version

I'm working on a font editor, MFEK. I also contribute to Unicode when I can. One of my fonts requires characters in Unicode 14.0.

For those reasons, I had to fork the project. I only need blocks, categories, and names, so I called my version QD-UNIC—“quick and dirty UNIC”. https://github.com/MFEK/qd-unic.rlib

I think that, perhaps, this project was too ambitious, in the sense that all the modules must match each other in Unicode version. That's what's caused a single PR, #226, to stall development of everything because of issues with unic-ucd-segment.

Obviously some of these modules are very easy to keep updated, and unic-gen works phenomenally well. Those implementing things like text segmentation and BIDI are going to be more difficult, and certainly subject to the needs of the community…which more often match mine than not. Basically, in short, users who only care about getting character names shouldn't suffer because no one has yet contributed a fix to a text segmentation problem.

Anyway, I doubt y'all will agree, which is why I forked, but I thought I'd let you know why I forked.

opened by ctrlcctrlv 3
Fix indexing within lines in BidiInfo::visual_runs

This fixes a bug where indices within a line are incorrectly used to index within the paragraph. Closes #272.

This is a port of servo/unicode-bidi#55 by @laurmaedje. Please see that PR for more details.

This change is

opened by mbrubeck 0
[unic-bidi] API concerns

I'm looking at using unic-bidi for a contemplated text layout project. For this discussion, consider the primary goal to be adding BiDi support to a (hypothetical) in-Rust implementation of text layout conforming to Piet's TextLayout trait. I see a number of concerns.

As a higher level process issue, one way to address those concerns is to fork unic-bidi. But I see value in trying to maintain a single source of truth for BiDi in the Rust ecosystem, so it feels like working with upstream is a good way to do that. That said, #272 is pretty good evidence to me, in addition to reverse dependencies (note that the RustPython reverse dep has actually recently been removed) that nobody is actually using unic-bidi, especially within the context of paragraph level text formatting.

Now for my concerns.

A big one is the &'text lifetime requirement on BidiInfo. That precludes retaining BidiInfo as part of our text layout object. One pattern for fixing this is to change the type parameter from <'text> to AsRef<str>, which would let the text be borrowed or owned, at the caller's request. Another way to fix it is to simply clone the string (it's not as if the implementation is particularly clone-averse). But I'm not convinced we actually need to retain the string. If you look at the methods on BidiInfo, I'm not sure reorder_line() is actually still valid in a modern environment; if my understanding is correct, HarfBuzz always wants its input in logical order. The string is also used in reorder_levels_per_char() to find the codepoint boundaries (I'm not sure this function is useful), and in visual_runs(), again mostly to find codepoint boundaries, but I think the "reset" logic can be run entirely from original_classes without making reference to the text.

Other concerns center around performance. I'm not advocating doing a lot of performance work now, but I also believe that the current API locks in certain decisions that make such performance work difficult in the future. That's mostly around the per-byte (technically, per UTF-8 code unit) arrays to represent things like levels. I think these don't represent the way people typically work with text in Rust, and in any case there are about twice as many UTF-8 code units as UTF-16 for most BiDi work.

So what I'd like better is for visual_runs to return an array of runs, where each run is a range in the text and a level (the latter would be after the reset logic is applied). I think the various "reorder*" functions should go, as I don't see them as being particularly useful, and if they are needed, I think their function can be written quite reasonably in terms of this proposed visual_runs.

Lastly, the stuff around ParagraphInfo seems crufty to me. Why is this passed in? It's possible to infer from the text position. Taking a step back, it's not clear to me that segmenting into paragraphs is a useful function for unic-bidi in the first place. It seems extremely likely that its client will have already done segmentation into paragraphs by the time it gets to BiDi analysis. If the goal of the crate is to cover all of UAX#9, then perhaps paragraph segmentation can be provided through a separate API. Then the scope of BidiInfo could be narrowed to a single paragraph, which I think would be a desirable simplification.

I post this in part to provoke discussion from potential other clients of the crate, and also to test whether unic is a good container for the work.

opened by raphlinus 6

Releases(v0.9.0)

v0.9.0(Mar 3, 2019)
Add

unic-ucd-name_aliases: Unicode Name Alias character properties.

Changed

unic-cli: Fallback to Name Alias for characters without Name value.

Fixed

ucd-ident: Use correct data table for PatternWhitespace property. [GH-254]

Misc

Use external git submodules for source data.

Migrate to Rust 2018 Edition.

Source code(tar.gz)
Source code(zip)
v0.8.0(Jan 2, 2019)
New Components

unic-ucd-block: List of all Unicode Blocks and the property assigning a block to each character.

unic-ucd-hangul: Unicode Hangul Syllable detection and Composition/Decomposition algorithms.

Other Updates

unic-ucd-name: Complete implementation for Unicode Name Property, with addition of Hangul and CJK Han names, as defined by The Unicode Standard.

Notes

This is the last release of the project before migration to Rust 2018 Edition.

Special thanks for Yan Li (@eyeplum) for implementing most of the features in this release.

Source code(tar.gz)
Source code(zip)
v0.7.0(Feb 7, 2018)
UNIC Applications

UNIC Applications are binary creates hosting in the same repository as unic super-crate, under the apps/ directory. These creates are not internal parts of the unic library, but tools designed and developed for the general audience, also serving as a test bed for the UNIC API. We are starting with CLI applications, and possibly expanding it to GUI and WEB applications, as well.

[unic-cli] The new UNIC CLI application provides command-line tools for working with Unicode characters and strings. In this release, first versions of unic-echo and unic-inspector commands are implemented.

New Components

Character Property

[unic-ucd-common] Common character properties (alphabetic, alphanumeric, control, numeric, and white_space).

[unic-ucd-ident] Unicode Identifier character properties.

[unic-ucd-segment] Unicode Segmentation character properties.

[unic-emoji-char] Unicode Emoji character properties.

String Algorithm

[unic-segment] Implementation of Unicode Text Segmentation algorithms (Grapheme Cluster and Word boundaries).

Other Updates

This release was delayed for a couple of cycles, because of the problems with running tests in a workspace with a mix of std and no-std creates. The issue is resolved as of 1.22.0.

Enable no_std for many of the existing components.

Bumped minimum Rust to 1.22.0.

Lots of small fixes for data types and internal structure updates.

Source code(tar.gz)
Source code(zip)
v0.6.0(Sep 22, 2017)
New components and modules

Abstractions for working with characters

[unic-char-range] Range and iterator types for characters, plus a chars!() macro. (Used as chars!('a'..'e'), chars!('a'..='e'), or chars!(..).)

[unic-char-property] New component based on the module previously in unic-utils, with new support for binary character properties.

Extending Unicode Character Database properties

[unic-ucd-name] New minimal implementation of Unicode character names (Name property).

[unic-ucd-case] New basic implementation of Unicode character case properties.

[unic-ucd-bidi] Add Bidi_Mirrored and Bidi_Control properties.

Dropped components and modules

Drop unic-utils's iter_all_chars() in favor of unic-char-range types and macros.

Other updates

All tables are now generated by the Rust pipeline! 🎉

The Rust table generation has been cleaned up to a very nice level of polish! ✨

[unic-utils] Restructure tables into a dedicated type, rather than a mix of traits and "blessed" std types.

Source code(tar.gz)
Source code(zip)
v0.5.0(Aug 6, 2017)
New component: [unic-ucd-category] Support General_Category Unicode (UCD) character property, implemented as enum GeneralCategory.

[unic-ucd-nomal] Support Decomposition_Type Unicode (UCD) character property, implemented as enum DecompositionType.

[unic-ucd-normal] Update Canonical_Combining_Class implementation to tuple struct and add update API accordingly.

[unic-ucd-age] Update Age property implementation to not cause API breakage on new Unicode versions.

[unic-utils] Rename from unic-ucd-utils, to contain all data-less utility functionalities. (https://github.com/behnam/rust-unic/issues/50)

Expand character property API in implementations, in the process of defining trait-based contracts for all (UCD and other) character properties. (https://github.com/behnam/rust-unic/issues/66, https://github.com/behnam/rust-unic/issues/34)

Reorganize code structure to make room for dev packages, like new unic-gen crate—which is going to replace the Python implementation for data table generation.

Drop data-dependent integration tests from packaging, allowing all tests pass for downloaded packages. (https://github.com/behnam/rust-unic/issues/34)

[unic-ucd] Expand cross-component and conformance tests. (https://github.com/behnam/rust-unic/issues/18, https://github.com/behnam/rust-unic/issues/43)

Drop dependency on rustc_test in favor of default integration test harness. (https://github.com/behnam/rust-unic/issues/76)

Source code(tar.gz)
Source code(zip)
v0.4.0(Jun 23, 2017)
Create UnicodeVersion type and use in all components for UNICODE_VERSION, and allow conversion to/from Age character property.

Split IDNA Mapping data into unic-idna-mapping and leave the process algorithms in unic-idna.

[ucd] Create common pattern for UCD character properties: For property called Prop, static function Prop::of(ch: char) to get value for a character, and ch.<prop>() using the helper trait called CharProp. Also, move all property value helpers into impl Prop as methods.

[idna] Use standard binary_search_by().

Pass in bench_it feature to components supporting it. (Only unic-bidi at the moment.)

Source code(tar.gz)
Source code(zip)
v0.3.0(Jun 22, 2017)
Add ucd::age component. (unic-ucd-age)

Source code(tar.gz)
Source code(zip)
v0.2.0(Jun 21, 2017)

Update UCD and IDNA data to Unicode 10.0.0, as released on 2017-06-20.
Source code(tar.gz)
Source code(zip)
v0.1.2(Jun 20, 2017)
Add a bunch of missing documentations.

Add a script to publish all crates, in order of dependency.

Source code(tar.gz)
Source code(zip)
v0.1.1(Jun 20, 2017)

Initial release with UCD, Bidi, IDNA, and Normalization components.
Source code(tar.gz)
Source code(zip)

Owner

open-i18n — Open Internationalization Initiative

GitHub https://crates.io/crates/unic

A turing-complete programming language using only zero-width unicode characters, inspired by brainfuck and whitespace.

Zero-Width A turing-complete programming language using only zero-width unicode characters, inspired by brainfuck and whitespace. Currently a (possibl

2 Jan 14, 2022

Like wc, but unicode-aware, and with per-line mode

34 May 24, 2022

Determine the Unicode class of a mathematical character in Rust.

unicode-math-class Determine the Unicode class of a mathematical character in Rust. Example use unicode_math_class::{class, MathClass}; assert_eq!(cl

3 Jan 10, 2023

OOLANG - an esoteric stack-based programming language where all instructions/commands are differnet unicode O characters

OOLANG is an esoteric stack-based programming language where all instructions/commands are differnet unicode O characters

2 Mar 20, 2022

A crate for converting an ASCII text string or file to a single unicode character

A crate for converting an ASCII text string or file to a single unicode character. Also provides a macro to embed encoded source code into a Rust source file. Can also do the same to Python code while still letting the code run as before by wrapping it in a decoder.

17 Dec 31, 2022

Rust crates with map and set with interval keys (ranges x..y).

This crates implements map and set with interval keys (ranges x..y). IntervalMap is implemented using red-black binary tree, where each node contains

8 Aug 23, 2022

List public items (public API) of Rust library crates. Enables diffing public API between releases.

cargo wrapper for this library You probably want the cargo wrapper to this library. See https://github.com/Enselic/cargo-public-items. public_items Li

20 Dec 26, 2022

Game development practices with Rust programming language. I want to use different crates for this.

Hazır Oyun Motorlarını Kullanarak Rust Dili Yardımıyla Oyunlar Geliştirmek Rust programlama dilinde oyun geliştirmek için popüler birkaç hazır çatıyı

16 Dec 27, 2022

A collection of crates to make minecraft development (client, server) with rust possible.

rust-craft rust-craft is a collection of crates to make minecraft development (client, server) with rust possible. Motivation There's no better way of

15 Mar 23, 2023

A procedural macro for configuring constant values across crates

toml-cfg Rough ideas: Crates can declare variables that can be overridden Anything const, e.g. usize, strings, etc. (Only) The "root crate" can overri

43 Dec 24, 2022

Intro: we are creating a software system for a pizza restaurant, one of the modules is supposed to handle the management of various pizza recipes and how the orders are put together, and a big part of the module will be the control of food types, the potential allergens in recipes, and calories counting.

rust_pizzeria Intro: we are creating a software system for a pizza restaurant, one of the modules is supposed to handle the management of various pizz

1 Oct 26, 2021

In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang.

Learn Rust What is this? In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang. This is usef

5 Nov 5, 2022

A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32 architectures going step-by-step into the world of reverse engineering Rust from scratch.

FREE Reverse Engineering Self-Study Course HERE Hacking Rust A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32

98 Jun 21, 2023

An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous and blocking clients respectively.

eithers_rust An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous a

2 Oct 24, 2021

UNIC: Unicode and Internationalization Crates for Rust

Related tags

Overview

UNIC: Unicode and Internationalization Crates for Rust

Project Goal

Project Status

Design Goals

Components and their Organization

The UNIC Super-Crate

Major Components

Applications

Code Organization: Combined Repository

Documentation

How to Use UNIC

License

Contribution

Code of Conduct

Comments

Releases(v0.9.0)

v0.9.0(Mar 3, 2019)

Add

Changed

Fixed

Misc

v0.8.0(Jan 2, 2019)

New Components

Other Updates

Notes

v0.7.0(Feb 7, 2018)

UNIC Applications

New Components

Character Property

String Algorithm

Other Updates

v0.6.0(Sep 22, 2017)

New components and modules

Abstractions for working with characters

Extending Unicode Character Database properties

Dropped components and modules

Other updates

v0.5.0(Aug 6, 2017)

v0.4.0(Jun 23, 2017)

v0.3.0(Jun 22, 2017)

v0.2.0(Jun 21, 2017)

v0.1.2(Jun 20, 2017)

v0.1.1(Jun 20, 2017)

Owner

open-i18n — Open Internationalization Initiative

A turing-complete programming language using only zero-width unicode characters, inspired by brainfuck and whitespace.

Like wc, but unicode-aware, and with per-line mode

Determine the Unicode class of a mathematical character in Rust.

OOLANG - an esoteric stack-based programming language where all instructions/commands are differnet unicode O characters

A crate for converting an ASCII text string or file to a single unicode character

Rust crates with map and set with interval keys (ranges x..y).

List public items (public API) of Rust library crates. Enables diffing public API between releases.

Game development practices with Rust programming language. I want to use different crates for this.

A collection of crates to make minecraft development (client, server) with rust possible.

A procedural macro for configuring constant values across crates

In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang.

A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32 architectures going step-by-step into the world of reverse engineering Rust from scratch.

An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous and blocking clients respectively.

Fast and simple datetime, date, time and duration parsing for rust.

A simpler and 5x faster alternative to HashMap in Rust, which doesn't use hashing and doesn't use heap

Safe, efficient, and ergonomic bindings to Wolfram LibraryLink and the Wolfram Language

This blog provides detailed status updates and useful information about Theseus OS and its development

Omeglib, a portmanteau of "omegle" and "library", is a crate for interacting with omegle, simply and asynchronously