Calculate Code Page ranges #23

m4rc1e · 2021-05-12T11:35:07Z

Still mega wip. I'm basing my implementation on ufo2ft which in turn is based on FontForge.

m4rc1e · 2021-05-13T14:23:22Z

This now produces the same result as the implementation in ufo2ft. I had to update src/os2.rs since the second codepage field was referencing the first.

I've added a simple test to check that the basic multilingual plane will enable all the specified codepages. I can add more meaningful tests if needed. However, due to this lib being in constant flux, it may be better to do this later.

@simoncozens I'm worrying about my overuse of clone and cloned when I'm using iterators and closures. Would be great to see what you would recommend.

m4rc1e · 2021-05-14T10:45:35Z

Perhaps it may be better to move this to the os2 module since it computes the codepages using the cmap instead of ufo.info?

simoncozens · 2021-05-14T12:24:31Z

That sounds like a nice idea: a method on an os2 that takes &mut self and a cmap table and fills in the relevant fields. Yes please!

moyogo · 2021-05-14T15:19:03Z

This would also be useful as a single tool.

simoncozens · 2021-05-14T15:21:14Z

Ping @cmyr

cmyr

Okay, a bunch of notes inline.

As a general observation: cargo fmt and cargo clippy are very friendly, and I would encourage you to use them both; I'll look into turning them on in CI for this repo.

cmyr · 2021-05-14T17:12:57Z

crates/fonticulus/src/basictables.rs

@@ -164,6 +165,15 @@ pub fn compile_os2(
        .open_type_os2_subscript_x_size
        .unwrap_or((upm * 0.65).round() as i32) as i16;

+    let unicodes:HashSet<u32> = mapping.keys().cloned().collect();


Instead of allocating a new HashSet, we could probably just pass in a reference to the HashMap directly, and use HashMap::contains_key? One possible downside of this is that it might make calc_code_page_ranges have a slightly weird signature, and harder to use from other places?

I'm going to move this function so it becomes a method of the os2 table since it makes more sense for the code to live there. I may just end up making the method signature take the cmap table as input but I'll think about it.

cmyr · 2021-05-14T17:15:12Z

crates/fonticulus/src/basictables.rs

@@ -164,6 +165,15 @@ pub fn compile_os2(
        .open_type_os2_subscript_x_size
        .unwrap_or((upm * 0.65).round() as i32) as i16;

+    let unicodes:HashSet<u32> = mapping.keys().cloned().collect();


In addition to the .cloned() method on Iterator, there's also a copied() method. In practice these will behave identically, but using copied() where possible makes it clearer to the reader that this is a cheap operation.

A few other ways to write this:

let unicodes: HashSet<_> = mapping.keys().copied().collect(); let unicodes = mapping.keys().copied().collect::<HashSet<_>>();

Thanks!

Which approach would you say is the best to adopt for the future? personally I find the turbofish style a little weird but for all I know this may be the most common.

I personally probably do the second form more often, but I wouldn't comment on the first form in code review 🤷

cmyr · 2021-05-14T17:23:33Z

crates/fonticulus/src/basictables.rs

+    let code_pages:Vec<u8> = info.open_type_os2_code_page_ranges.as_ref()
+        .unwrap_or(&calc_code_page_ranges(&unicodes))
+        .iter()
+        .cloned()
+        .collect();


a few things:

as written, this is doing some extra work.

unwrap_or: this always generates the code passed in, so we will run calc_code_page_ranges even if we already have the ranges. The alternative is unwrap_or_else, which takes a closure that is evaluated lazily.

iter..collect: calc_code_page_ranges returns exactly the Vec<u8> that we want, but here we're then iterating it again to create a copy.

if we're only using unicodes in the case where the code page ranges don't exist, we should avoid creating it unless necessary.

Put all together, I would write this as something like:

let code_pages = info.open_type_os2_code_page_ranges .clone() // clone on an option clones the contents if it exists, otherwise does nothing .unwrap_or_else(|| { // only called in the `None` case // type should be inferred based on the function we pass it to? let unicodes = mapping.keys().copied().collect(); calc_code_page_ranges(&unicodes) });

Awesome thanks!

cmyr · 2021-05-14T17:32:10Z

crates/fonticulus/src/basictables.rs

+    let code_pages1: Vec<u8> = code_pages.iter().filter(|&x| *x < 32).cloned().collect();
+    let code_pages2: Vec<u8> = code_pages.iter().filter(|&x| *x >= 32).map(|&x|x - 32).collect();


I would write this as,

let mut code_pages = code_pages; // unnecessary, just give this a mut binding above code_pages.sort_unstable(); // not necessary if already sorted let split_at = code_pages.iter().position(|x| x >= 32).unwrap_or_else(|| code_pages.len()); let mut code_pages2 = code_pages.split_off(split_at); code_pages2.iter_mut().for_each(|x| x -= 32);

cmyr · 2021-05-14T17:38:55Z

crates/fonticulus/src/fontinfo.rs

+/// implementation based on ufo2ft:
+/// https://github.com/googlefonts/ufo2ft/blob/main/lib/ufo2ft/util.py#l307
+pub fn calc_code_page_ranges(unicodes: &HashSet<u32>) -> Vec<u8> {
+    let mut code_page_ranges = HashSet::new();


Is there a reason to prefer HashSet to Vec, here?

would it make sense for us to generate the two separate vecs here, instead of doing it at the call-site?

Is there a reason to prefer HashSet to Vec, here?

You're right, this could and probably should be a vec.

would it make sense for us to generate the two separate vecs here, instead of doing it at the call-site?

We could even go a step further and return a tuple containing two 32bit ints which is the desired end result.

oh totally, that's the sort of thing that rust excels at, and then we're doing way less work.

cmyr · 2021-05-14T17:41:58Z

crates/fonticulus/src/fontinfo.rs

+    let mut code_page_ranges = HashSet::new();
+
+    let ascii_range: HashSet<u32> = (0x20..0x7E).collect();
+    let has_ascii = ascii_range.is_subset(&unicodes);


I would be tempted to write this as,

let has_ascii = (0x20..0x7E).all(|x| unicodes.contains(&x));

This saves us one more allocation. 🤷

cmyr · 2021-05-14T18:20:35Z

crates/fonticulus/src/fontinfo.rs

+    if unicodes_contains('Þ') && has_ascii {
+        code_page_ranges.insert(0); // Latin 1
+    }
+    if unicodes_contains('Ľ') && has_ascii {
+        code_page_ranges.insert(1); // Latin 2
+    }
+    if unicodes_contains('Б') {
+        code_page_ranges.insert(2); // Cyrillic
+        if unicodes_contains('Ѕ') && has_lineart {
+            code_page_ranges.insert(57); // IBM Cyrillic
+        }
+        if unicodes_contains('╜') && has_lineart {
+            code_page_ranges.insert(49); // MS-DOS Russian
+        }
+    }


now I'm just having fun, but I would be tempted to at least handle this simple case of "check char and addd to table" with a macro:

some time later:

Okay I'm going to not let myself rabbit-hole on this too hard, but as an illustration: with the following macro, I can replace this section with:

macro_rules! add_page_if_needed { // helper for if there's no condition and no block ($collection:ident, $ranges:ident, $chr:expr, $page:expr) => { add_page_if_needed!($collection, $chr, $page, true, {}); }; // helper for if there's a condition but no block ($collection:ident, $ranges:ident, $chr:expr, $page:expr, $constraint:expr) => { add_page_if_needed!($collection, $ranges, $chr, $page, $constraint, {}); }; // the verbose form ($collection:ident, $ranges:ident, $chr:expr, $page:expr, $constraint:expr, $block:block) => { if $collection.contains(&($chr as u32)) && $constraint { $ranges.insert($page); $block; } }; }

Suggested change

if unicodes_contains('Þ') && has_ascii {

code_page_ranges.insert(0); // Latin 1

}

if unicodes_contains('Ľ') && has_ascii {

code_page_ranges.insert(1); // Latin 2

}

if unicodes_contains('Б') {

code_page_ranges.insert(2); // Cyrillic

if unicodes_contains('Ѕ') && has_lineart {

code_page_ranges.insert(57); // IBM Cyrillic

}

if unicodes_contains('╜') && has_lineart {

code_page_ranges.insert(49); // MS-DOS Russian

}

}

add_page_if_needed!(unicodes, code_page_ranges,'Þ', 0, has_ascii);

add_page_if_needed!(unicodes, code_page_ranges,'Ľ', 1, has_ascii);

add_page_if_needed!(unicodes, code_page_ranges, 'Б', 2, true, {

add_page_if_needed!(unicodes, code_page_ranges,'Ѕ', 57, has_lineart);

add_page_if_needed!(unicodes, code_page_ranges, '╜', 49, has_lineart);

});

In any case: this is provided more for illustrative purposes, and rust macros can be confusing, but I wanted to point this out as a way I would think about this problem. :)

m4rc1e · 2021-05-15T19:03:08Z

Thank you very much for your review @cmyr! I hope it wasn't too painful. I've only been studying the existing code and the Rust book. Are there other resources that you would recommend? I mainly come from a python background but have knowledge of C/C++. I find myself fighting the compiler a lot when it comes to membership and lifetimes so any links that helped you grok these topics quicker would be much appreciated.

cmyr · 2021-05-15T19:12:14Z

I wish I had a good answer, but I largely don't. When I was becoming familiar with the language I spent a lot of time hanging out in a #rust-beginners irc channel; it doesn't exist anymore, but I believe there's a pretty active discord community? Another very useful resource is the this week in rust newsletter; skimming that and reading things that interest you will be to just be ambiently aware of how other folks are writing the language.

Other than that, clippy and rustfmt really help avoid a lot of common problems; it really lets people avoid the bulk of argument over formatting and idiom, and clippy will often point out more idiomatic alternatives (like suggesting my_vec.clear() instead of my_vec.truncate(0), as an example).

For the membership and lifetimes stuff: generally as folks write more rust they start to develop an intuition for a lot of this stuff. One thing that can be useful is to really try to imagine the stack, and imagine that things owned 'higher' in the call stack can be passed 'down' the stack, but things owned 'lower' can't be passed up, because when the function returns that memory is freed.

m4rc1e · 2021-05-21T13:35:04Z

@cmyr I've implemented your feedback and moved most of the functionality to the os2 module. We now have two methods which can generate the codepage bits from a list or from a ufo's codepage attribute. Feel free to take another look. It isn't urgent. Once you're happy, I'll tackle the unicode ranges.

cmyr

A few more comments that you're totally free to ignore.

I think that the code that you're writing works, and I think that's the most important thing; a lot of these other optimizations can come along later. That said, I have a sneaking suspicion that somebody is going to be using this library ten years in the future, so I don't think any effort spent on code quality etc is wasted time?

crates/fonticulus/src/basictables.rs

cmyr · 2021-05-21T18:28:34Z

src/os2.rs

+    }
+    /// implementation based on ufo2ft:
+    /// https://github.com/googlefonts/ufo2ft/blob/main/lib/ufo2ft/util.py#l307
+    pub fn calc_code_page_ranges(&mut self, mapping: &BTreeMap<u32, u16>) {


So now that I have a clearer idea of what we're doing, it feels like what we really want here is a function that creates two 32-bit integers in place. This is the sort of low-level thing that can often be annoying to do, but with Rust we could make this quite easy. What I would personally do is make a helper type that looks like a collection but is just doing bit arithmetic under the hood, something like:

#[derive(Default)] struct CodePageRanges { one: u32, two: u32, } impl CodePageRanges { fn set(&mut self, idx: usize) { assert!(idx < 64); if idx >= 32 { self.two |= 1 << (idx - 32); } else { self.one |= 1 << idx; } } fn into_inner(self) -> (u32, u32) { (self.one, self.two) } }

And then you can just call this directly, everwhere that you are using code_page_ranges:

let mut code_page_ranges = CodePageRanges::default(); if unicodes_contains('Þ') && has_ascii { code_page_ranges.set(0); // Latin 1 }

This then does everything at once; we're building up our final representation directly, as we add each new value.

Nice! this is much cleaner.

Co-authored-by: Colin Rofls <colin@cmyr.net>

m4rc1e · 2021-06-07T09:05:09Z

@simoncozens feel free to merge this. I will implement the unicode range bits next so I may end up refactoring this part. I'm going to keep Colin's comments visible since they may help me later.

simoncozens · 2021-06-07T11:04:55Z

Thanks very much for this - looks great!

m4rc1e force-pushed the codepages branch 5 times, most recently from d2b5cf7 to a2a02f8 Compare May 13, 2021 14:13

m4rc1e marked this pull request as ready for review May 13, 2021 14:19

m4rc1e force-pushed the codepages branch from 5d81c19 to 5240d26 Compare May 14, 2021 10:43

cmyr reviewed May 14, 2021

View reviewed changes

m4rc1e added 4 commits May 18, 2021 10:49

calc_code_page_ranges added

5859d0c

cleanup and add simple test

7cc0185

Implement Colin feedback

0d707ea

Move codepage functionality to os2 module

20273e0

m4rc1e force-pushed the codepages branch from 5240d26 to 20273e0 Compare May 18, 2021 09:49

run rustfmt on changed/new files

1a2ee13

cmyr reviewed May 21, 2021

View reviewed changes

Update crates/fonticulus/src/basictables.rs

d2f7a93

Co-authored-by: Colin Rofls <colin@cmyr.net>

m4rc1e force-pushed the codepages branch from 1d98ec8 to d2f7a93 Compare June 2, 2021 16:18

simoncozens merged commit 46226b0 into simoncozens:main Jun 7, 2021

m4rc1e deleted the codepages branch June 8, 2021 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate Code Page ranges #23

Calculate Code Page ranges #23

m4rc1e commented May 12, 2021

m4rc1e commented May 13, 2021 •

edited

m4rc1e commented May 14, 2021

simoncozens commented May 14, 2021

moyogo commented May 14, 2021

simoncozens commented May 14, 2021

cmyr left a comment

cmyr May 14, 2021

m4rc1e May 15, 2021

cmyr May 14, 2021

m4rc1e May 15, 2021

cmyr May 15, 2021

cmyr May 14, 2021

m4rc1e May 15, 2021

cmyr May 14, 2021

cmyr May 14, 2021

m4rc1e May 15, 2021

cmyr May 15, 2021

cmyr May 14, 2021 •

edited

m4rc1e May 15, 2021

cmyr May 14, 2021

m4rc1e commented May 15, 2021

cmyr commented May 15, 2021

m4rc1e commented May 21, 2021 •

edited

cmyr left a comment

cmyr May 21, 2021

m4rc1e May 25, 2021

m4rc1e commented Jun 7, 2021

simoncozens commented Jun 7, 2021

		let code_pages1: Vec<u8> = code_pages.iter().filter(\|&x\| *x < 32).cloned().collect();
		let code_pages2: Vec<u8> = code_pages.iter().filter(\|&x\| *x >= 32).map(\|&x\|x - 32).collect();

Calculate Code Page ranges #23

Calculate Code Page ranges #23

Conversation

m4rc1e commented May 12, 2021

m4rc1e commented May 13, 2021 • edited

m4rc1e commented May 14, 2021

simoncozens commented May 14, 2021

moyogo commented May 14, 2021

simoncozens commented May 14, 2021

cmyr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmyr May 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m4rc1e commented May 15, 2021

cmyr commented May 15, 2021

m4rc1e commented May 21, 2021 • edited

cmyr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m4rc1e commented Jun 7, 2021

simoncozens commented Jun 7, 2021

m4rc1e commented May 13, 2021 •

edited

cmyr May 14, 2021 •

edited

m4rc1e commented May 21, 2021 •

edited