Skip to content
This repository has been archived by the owner on May 20, 2022. It is now read-only.

Calculate Code Page ranges #23

Merged
merged 6 commits into from Jun 7, 2021
Merged

Conversation

m4rc1e
Copy link
Contributor

@m4rc1e m4rc1e commented May 12, 2021

Still mega wip. I'm basing my implementation on ufo2ft which in turn is based on FontForge.

@m4rc1e m4rc1e force-pushed the codepages branch 5 times, most recently from d2b5cf7 to a2a02f8 Compare May 13, 2021 14:13
@m4rc1e m4rc1e marked this pull request as ready for review May 13, 2021 14:19
@m4rc1e
Copy link
Contributor Author

m4rc1e commented May 13, 2021

This now produces the same result as the implementation in ufo2ft. I had to update src/os2.rs since the second codepage field was referencing the first.

I've added a simple test to check that the basic multilingual plane will enable all the specified codepages. I can add more meaningful tests if needed. However, due to this lib being in constant flux, it may be better to do this later.

@simoncozens I'm worrying about my overuse of clone and cloned when I'm using iterators and closures. Would be great to see what you would recommend.

@m4rc1e
Copy link
Contributor Author

m4rc1e commented May 14, 2021

Perhaps it may be better to move this to the os2 module since it computes the codepages using the cmap instead of ufo.info?

@simoncozens
Copy link
Owner

That sounds like a nice idea: a method on an os2 that takes &mut self and a cmap table and fills in the relevant fields. Yes please!

@moyogo
Copy link
Contributor

moyogo commented May 14, 2021

This would also be useful as a single tool.

@simoncozens
Copy link
Owner

Ping @cmyr

Copy link
Collaborator

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, a bunch of notes inline.

As a general observation: cargo fmt and cargo clippy are very friendly, and I would encourage you to use them both; I'll look into turning them on in CI for this repo.

@@ -164,6 +165,15 @@ pub fn compile_os2(
.open_type_os2_subscript_x_size
.unwrap_or((upm * 0.65).round() as i32) as i16;

let unicodes:HashSet<u32> = mapping.keys().cloned().collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of allocating a new HashSet, we could probably just pass in a reference to the HashMap directly, and use HashMap::contains_key? One possible downside of this is that it might make calc_code_page_ranges have a slightly weird signature, and harder to use from other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to move this function so it becomes a method of the os2 table since it makes more sense for the code to live there. I may just end up making the method signature take the cmap table as input but I'll think about it.

@@ -164,6 +165,15 @@ pub fn compile_os2(
.open_type_os2_subscript_x_size
.unwrap_or((upm * 0.65).round() as i32) as i16;

let unicodes:HashSet<u32> = mapping.keys().cloned().collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the .cloned() method on Iterator, there's also a copied() method. In practice these will behave identically, but using copied() where possible makes it clearer to the reader that this is a cheap operation.

A few other ways to write this:

let unicodes: HashSet<_> = mapping.keys().copied().collect();
let unicodes = mapping.keys().copied().collect::<HashSet<_>>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Which approach would you say is the best to adopt for the future? personally I find the turbofish style a little weird but for all I know this may be the most common.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally probably do the second form more often, but I wouldn't comment on the first form in code review 🤷

Comment on lines 169 to 173
let code_pages:Vec<u8> = info.open_type_os2_code_page_ranges.as_ref()
.unwrap_or(&calc_code_page_ranges(&unicodes))
.iter()
.cloned()
.collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few things:

as written, this is doing some extra work.

  • unwrap_or: this always generates the code passed in, so we will run calc_code_page_ranges even if we already have the ranges. The alternative is unwrap_or_else, which takes a closure that is evaluated lazily.
  • iter..collect: calc_code_page_ranges returns exactly the Vec<u8> that we want, but here we're then iterating it again to create a copy.
  • if we're only using unicodes in the case where the code page ranges don't exist, we should avoid creating it unless necessary.

Put all together, I would write this as something like:

let code_pages = info.open_type_os2_code_page_ranges
        .clone() // clone on an option clones the contents if it exists, otherwise does nothing
        .unwrap_or_else(|| { // only called in the `None` case
            // type should be inferred based on the function we pass it to?
            let unicodes = mapping.keys().copied().collect();
            calc_code_page_ranges(&unicodes)
        });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks!

Comment on lines 174 to 175
let code_pages1: Vec<u8> = code_pages.iter().filter(|&x| *x < 32).cloned().collect();
let code_pages2: Vec<u8> = code_pages.iter().filter(|&x| *x >= 32).map(|&x|x - 32).collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write this as,

let mut code_pages = code_pages; // unnecessary, just give this a mut binding above
code_pages.sort_unstable(); // not necessary if already sorted
let split_at = code_pages.iter().position(|x| x >= 32).unwrap_or_else(|| code_pages.len());
let mut code_pages2 = code_pages.split_off(split_at);
code_pages2.iter_mut().for_each(|x| x -= 32);

/// implementation based on ufo2ft:
/// https://github.com/googlefonts/ufo2ft/blob/main/lib/ufo2ft/util.py#l307
pub fn calc_code_page_ranges(unicodes: &HashSet<u32>) -> Vec<u8> {
let mut code_page_ranges = HashSet::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is there a reason to prefer HashSet to Vec, here?
  • would it make sense for us to generate the two separate vecs here, instead of doing it at the call-site?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer HashSet to Vec, here?

You're right, this could and probably should be a vec.

would it make sense for us to generate the two separate vecs here, instead of doing it at the call-site?

We could even go a step further and return a tuple containing two 32bit ints which is the desired end result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh totally, that's the sort of thing that rust excels at, and then we're doing way less work.

let mut code_page_ranges = HashSet::new();

let ascii_range: HashSet<u32> = (0x20..0x7E).collect();
let has_ascii = ascii_range.is_subset(&unicodes);
Copy link
Collaborator

@cmyr cmyr May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to write this as,

let has_ascii = (0x20..0x7E).all(|x| unicodes.contains(&x));

This saves us one more allocation. 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Comment on lines 172 to 186
if unicodes_contains('Þ') && has_ascii {
code_page_ranges.insert(0); // Latin 1
}
if unicodes_contains('Ľ') && has_ascii {
code_page_ranges.insert(1); // Latin 2
}
if unicodes_contains('Б') {
code_page_ranges.insert(2); // Cyrillic
if unicodes_contains('Ѕ') && has_lineart {
code_page_ranges.insert(57); // IBM Cyrillic
}
if unicodes_contains('╜') && has_lineart {
code_page_ranges.insert(49); // MS-DOS Russian
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I'm just having fun, but I would be tempted to at least handle this simple case of "check char and addd to table" with a macro:

some time later:

Okay I'm going to not let myself rabbit-hole on this too hard, but as an illustration: with the following macro, I can replace this section with:

macro_rules! add_page_if_needed {
    // helper for if there's no condition and no block
    ($collection:ident, $ranges:ident, $chr:expr, $page:expr) => {
        add_page_if_needed!($collection, $chr, $page, true, {});
    };
    // helper for if there's a condition but no block
    ($collection:ident, $ranges:ident, $chr:expr, $page:expr, $constraint:expr) => {
        add_page_if_needed!($collection, $ranges, $chr, $page, $constraint, {});
    };
    // the verbose form
    ($collection:ident, $ranges:ident, $chr:expr, $page:expr, $constraint:expr, $block:block) => {
        if $collection.contains(&($chr as u32)) && $constraint {
            $ranges.insert($page);
            $block;
        }
    };
}
Suggested change
if unicodes_contains('Þ') && has_ascii {
code_page_ranges.insert(0); // Latin 1
}
if unicodes_contains('Ľ') && has_ascii {
code_page_ranges.insert(1); // Latin 2
}
if unicodes_contains('Б') {
code_page_ranges.insert(2); // Cyrillic
if unicodes_contains('Ѕ') && has_lineart {
code_page_ranges.insert(57); // IBM Cyrillic
}
if unicodes_contains('╜') && has_lineart {
code_page_ranges.insert(49); // MS-DOS Russian
}
}
add_page_if_needed!(unicodes, code_page_ranges,'Þ', 0, has_ascii);
add_page_if_needed!(unicodes, code_page_ranges,'Ľ', 1, has_ascii);
add_page_if_needed!(unicodes, code_page_ranges, 'Б', 2, true, {
add_page_if_needed!(unicodes, code_page_ranges,'Ѕ', 57, has_lineart);
add_page_if_needed!(unicodes, code_page_ranges, '╜', 49, has_lineart);
});

In any case: this is provided more for illustrative purposes, and rust macros can be confusing, but I wanted to point this out as a way I would think about this problem. :)

@m4rc1e
Copy link
Contributor Author

m4rc1e commented May 15, 2021

Thank you very much for your review @cmyr! I hope it wasn't too painful. I've only been studying the existing code and the Rust book. Are there other resources that you would recommend? I mainly come from a python background but have knowledge of C/C++. I find myself fighting the compiler a lot when it comes to membership and lifetimes so any links that helped you grok these topics quicker would be much appreciated.

@cmyr
Copy link
Collaborator

cmyr commented May 15, 2021

I wish I had a good answer, but I largely don't. When I was becoming familiar with the language I spent a lot of time hanging out in a #rust-beginners irc channel; it doesn't exist anymore, but I believe there's a pretty active discord community? Another very useful resource is the this week in rust newsletter; skimming that and reading things that interest you will be to just be ambiently aware of how other folks are writing the language.

Other than that, clippy and rustfmt really help avoid a lot of common problems; it really lets people avoid the bulk of argument over formatting and idiom, and clippy will often point out more idiomatic alternatives (like suggesting my_vec.clear() instead of my_vec.truncate(0), as an example).

For the membership and lifetimes stuff: generally as folks write more rust they start to develop an intuition for a lot of this stuff. One thing that can be useful is to really try to imagine the stack, and imagine that things owned 'higher' in the call stack can be passed 'down' the stack, but things owned 'lower' can't be passed up, because when the function returns that memory is freed.

@m4rc1e
Copy link
Contributor Author

m4rc1e commented May 21, 2021

@cmyr I've implemented your feedback and moved most of the functionality to the os2 module. We now have two methods which can generate the codepage bits from a list or from a ufo's codepage attribute. Feel free to take another look. It isn't urgent. Once you're happy, I'll tackle the unicode ranges.

Copy link
Collaborator

@cmyr cmyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments that you're totally free to ignore.

I think that the code that you're writing works, and I think that's the most important thing; a lot of these other optimizations can come along later. That said, I have a sneaking suspicion that somebody is going to be using this library ten years in the future, so I don't think any effort spent on code quality etc is wasted time?

crates/fonticulus/src/basictables.rs Outdated Show resolved Hide resolved
}
/// implementation based on ufo2ft:
/// https://github.com/googlefonts/ufo2ft/blob/main/lib/ufo2ft/util.py#l307
pub fn calc_code_page_ranges(&mut self, mapping: &BTreeMap<u32, u16>) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now that I have a clearer idea of what we're doing, it feels like what we really want here is a function that creates two 32-bit integers in place. This is the sort of low-level thing that can often be annoying to do, but with Rust we could make this quite easy. What I would personally do is make a helper type that looks like a collection but is just doing bit arithmetic under the hood, something like:

#[derive(Default)]
struct CodePageRanges {
    one: u32,
    two: u32,
}

impl CodePageRanges {
    fn set(&mut self, idx: usize) {
        assert!(idx < 64);
        if idx >= 32 {
            self.two |= 1 << (idx - 32);
        } else {
            self.one |= 1 << idx;
        }
    }
    
    fn into_inner(self) -> (u32, u32) {
        (self.one, self.two)
    }
}

And then you can just call this directly, everwhere that you are using code_page_ranges:

let mut code_page_ranges = CodePageRanges::default();

if unicodes_contains('Þ') && has_ascii {
    code_page_ranges.set(0); // Latin 1
}

This then does everything at once; we're building up our final representation directly, as we add each new value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! this is much cleaner.

Co-authored-by: Colin Rofls <colin@cmyr.net>
@m4rc1e
Copy link
Contributor Author

m4rc1e commented Jun 7, 2021

@simoncozens feel free to merge this. I will implement the unicode range bits next so I may end up refactoring this part. I'm going to keep Colin's comments visible since they may help me later.

@simoncozens simoncozens merged commit 46226b0 into simoncozens:main Jun 7, 2021
@simoncozens
Copy link
Owner

Thanks very much for this - looks great!

@m4rc1e m4rc1e deleted the codepages branch June 8, 2021 08:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants