Conversation
d2b5cf7
to
a2a02f8
Compare
This now produces the same result as the implementation in ufo2ft. I had to update I've added a simple test to check that the basic multilingual plane will enable all the specified codepages. I can add more meaningful tests if needed. However, due to this lib being in constant flux, it may be better to do this later. @simoncozens I'm worrying about my overuse of clone and cloned when I'm using iterators and closures. Would be great to see what you would recommend. |
Perhaps it may be better to move this to the os2 module since it computes the codepages using the cmap instead of ufo.info? |
That sounds like a nice idea: a method on an os2 that takes &mut self and a cmap table and fills in the relevant fields. Yes please! |
This would also be useful as a single tool. |
Ping @cmyr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, a bunch of notes inline.
As a general observation: cargo fmt
and cargo clippy
are very friendly, and I would encourage you to use them both; I'll look into turning them on in CI for this repo.
crates/fonticulus/src/basictables.rs
Outdated
@@ -164,6 +165,15 @@ pub fn compile_os2( | |||
.open_type_os2_subscript_x_size | |||
.unwrap_or((upm * 0.65).round() as i32) as i16; | |||
|
|||
let unicodes:HashSet<u32> = mapping.keys().cloned().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of allocating a new HashSet
, we could probably just pass in a reference to the HashMap
directly, and use HashMap::contains_key
? One possible downside of this is that it might make calc_code_page_ranges
have a slightly weird signature, and harder to use from other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to move this function so it becomes a method of the os2 table since it makes more sense for the code to live there. I may just end up making the method signature take the cmap table as input but I'll think about it.
crates/fonticulus/src/basictables.rs
Outdated
@@ -164,6 +165,15 @@ pub fn compile_os2( | |||
.open_type_os2_subscript_x_size | |||
.unwrap_or((upm * 0.65).round() as i32) as i16; | |||
|
|||
let unicodes:HashSet<u32> = mapping.keys().cloned().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the .cloned()
method on Iterator
, there's also a copied()
method. In practice these will behave identically, but using copied()
where possible makes it clearer to the reader that this is a cheap operation.
A few other ways to write this:
let unicodes: HashSet<_> = mapping.keys().copied().collect();
let unicodes = mapping.keys().copied().collect::<HashSet<_>>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Which approach would you say is the best to adopt for the future? personally I find the turbofish style a little weird but for all I know this may be the most common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally probably do the second form more often, but I wouldn't comment on the first form in code review 🤷
crates/fonticulus/src/basictables.rs
Outdated
let code_pages:Vec<u8> = info.open_type_os2_code_page_ranges.as_ref() | ||
.unwrap_or(&calc_code_page_ranges(&unicodes)) | ||
.iter() | ||
.cloned() | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few things:
as written, this is doing some extra work.
unwrap_or
: this always generates the code passed in, so we will runcalc_code_page_ranges
even if we already have the ranges. The alternative isunwrap_or_else
, which takes a closure that is evaluated lazily.iter..collect
:calc_code_page_ranges
returns exactly theVec<u8>
that we want, but here we're then iterating it again to create a copy.- if we're only using
unicodes
in the case where the code page ranges don't exist, we should avoid creating it unless necessary.
Put all together, I would write this as something like:
let code_pages = info.open_type_os2_code_page_ranges
.clone() // clone on an option clones the contents if it exists, otherwise does nothing
.unwrap_or_else(|| { // only called in the `None` case
// type should be inferred based on the function we pass it to?
let unicodes = mapping.keys().copied().collect();
calc_code_page_ranges(&unicodes)
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks!
crates/fonticulus/src/basictables.rs
Outdated
let code_pages1: Vec<u8> = code_pages.iter().filter(|&x| *x < 32).cloned().collect(); | ||
let code_pages2: Vec<u8> = code_pages.iter().filter(|&x| *x >= 32).map(|&x|x - 32).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write this as,
let mut code_pages = code_pages; // unnecessary, just give this a mut binding above
code_pages.sort_unstable(); // not necessary if already sorted
let split_at = code_pages.iter().position(|x| x >= 32).unwrap_or_else(|| code_pages.len());
let mut code_pages2 = code_pages.split_off(split_at);
code_pages2.iter_mut().for_each(|x| x -= 32);
crates/fonticulus/src/fontinfo.rs
Outdated
/// implementation based on ufo2ft: | ||
/// https://github.com/googlefonts/ufo2ft/blob/main/lib/ufo2ft/util.py#l307 | ||
pub fn calc_code_page_ranges(unicodes: &HashSet<u32>) -> Vec<u8> { | ||
let mut code_page_ranges = HashSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is there a reason to prefer
HashSet
toVec
, here? - would it make sense for us to generate the two separate vecs here, instead of doing it at the call-site?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer HashSet to Vec, here?
You're right, this could and probably should be a vec.
would it make sense for us to generate the two separate vecs here, instead of doing it at the call-site?
We could even go a step further and return a tuple containing two 32bit ints which is the desired end result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh totally, that's the sort of thing that rust excels at, and then we're doing way less work.
crates/fonticulus/src/fontinfo.rs
Outdated
let mut code_page_ranges = HashSet::new(); | ||
|
||
let ascii_range: HashSet<u32> = (0x20..0x7E).collect(); | ||
let has_ascii = ascii_range.is_subset(&unicodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be tempted to write this as,
let has_ascii = (0x20..0x7E).all(|x| unicodes.contains(&x));
This saves us one more allocation. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
crates/fonticulus/src/fontinfo.rs
Outdated
if unicodes_contains('Þ') && has_ascii { | ||
code_page_ranges.insert(0); // Latin 1 | ||
} | ||
if unicodes_contains('Ľ') && has_ascii { | ||
code_page_ranges.insert(1); // Latin 2 | ||
} | ||
if unicodes_contains('Б') { | ||
code_page_ranges.insert(2); // Cyrillic | ||
if unicodes_contains('Ѕ') && has_lineart { | ||
code_page_ranges.insert(57); // IBM Cyrillic | ||
} | ||
if unicodes_contains('╜') && has_lineart { | ||
code_page_ranges.insert(49); // MS-DOS Russian | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I'm just having fun, but I would be tempted to at least handle this simple case of "check char and addd to table" with a macro:
some time later:
Okay I'm going to not let myself rabbit-hole on this too hard, but as an illustration: with the following macro, I can replace this section with:
macro_rules! add_page_if_needed {
// helper for if there's no condition and no block
($collection:ident, $ranges:ident, $chr:expr, $page:expr) => {
add_page_if_needed!($collection, $chr, $page, true, {});
};
// helper for if there's a condition but no block
($collection:ident, $ranges:ident, $chr:expr, $page:expr, $constraint:expr) => {
add_page_if_needed!($collection, $ranges, $chr, $page, $constraint, {});
};
// the verbose form
($collection:ident, $ranges:ident, $chr:expr, $page:expr, $constraint:expr, $block:block) => {
if $collection.contains(&($chr as u32)) && $constraint {
$ranges.insert($page);
$block;
}
};
}
if unicodes_contains('Þ') && has_ascii { | |
code_page_ranges.insert(0); // Latin 1 | |
} | |
if unicodes_contains('Ľ') && has_ascii { | |
code_page_ranges.insert(1); // Latin 2 | |
} | |
if unicodes_contains('Б') { | |
code_page_ranges.insert(2); // Cyrillic | |
if unicodes_contains('Ѕ') && has_lineart { | |
code_page_ranges.insert(57); // IBM Cyrillic | |
} | |
if unicodes_contains('╜') && has_lineart { | |
code_page_ranges.insert(49); // MS-DOS Russian | |
} | |
} | |
add_page_if_needed!(unicodes, code_page_ranges,'Þ', 0, has_ascii); | |
add_page_if_needed!(unicodes, code_page_ranges,'Ľ', 1, has_ascii); | |
add_page_if_needed!(unicodes, code_page_ranges, 'Б', 2, true, { | |
add_page_if_needed!(unicodes, code_page_ranges,'Ѕ', 57, has_lineart); | |
add_page_if_needed!(unicodes, code_page_ranges, '╜', 49, has_lineart); | |
}); | |
In any case: this is provided more for illustrative purposes, and rust macros can be confusing, but I wanted to point this out as a way I would think about this problem. :)
Thank you very much for your review @cmyr! I hope it wasn't too painful. I've only been studying the existing code and the Rust book. Are there other resources that you would recommend? I mainly come from a python background but have knowledge of C/C++. I find myself fighting the compiler a lot when it comes to membership and lifetimes so any links that helped you grok these topics quicker would be much appreciated. |
I wish I had a good answer, but I largely don't. When I was becoming familiar with the language I spent a lot of time hanging out in a Other than that, clippy and rustfmt really help avoid a lot of common problems; it really lets people avoid the bulk of argument over formatting and idiom, and clippy will often point out more idiomatic alternatives (like suggesting For the membership and lifetimes stuff: generally as folks write more rust they start to develop an intuition for a lot of this stuff. One thing that can be useful is to really try to imagine the stack, and imagine that things owned 'higher' in the call stack can be passed 'down' the stack, but things owned 'lower' can't be passed up, because when the function returns that memory is freed. |
@cmyr I've implemented your feedback and moved most of the functionality to the os2 module. We now have two methods which can generate the codepage bits from a list or from a ufo's codepage attribute. Feel free to take another look. It isn't urgent. Once you're happy, I'll tackle the unicode ranges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments that you're totally free to ignore.
I think that the code that you're writing works, and I think that's the most important thing; a lot of these other optimizations can come along later. That said, I have a sneaking suspicion that somebody is going to be using this library ten years in the future, so I don't think any effort spent on code quality etc is wasted time?
} | ||
/// implementation based on ufo2ft: | ||
/// https://github.com/googlefonts/ufo2ft/blob/main/lib/ufo2ft/util.py#l307 | ||
pub fn calc_code_page_ranges(&mut self, mapping: &BTreeMap<u32, u16>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now that I have a clearer idea of what we're doing, it feels like what we really want here is a function that creates two 32-bit integers in place. This is the sort of low-level thing that can often be annoying to do, but with Rust we could make this quite easy. What I would personally do is make a helper type that looks like a collection but is just doing bit arithmetic under the hood, something like:
#[derive(Default)]
struct CodePageRanges {
one: u32,
two: u32,
}
impl CodePageRanges {
fn set(&mut self, idx: usize) {
assert!(idx < 64);
if idx >= 32 {
self.two |= 1 << (idx - 32);
} else {
self.one |= 1 << idx;
}
}
fn into_inner(self) -> (u32, u32) {
(self.one, self.two)
}
}
And then you can just call this directly, everwhere that you are using code_page_ranges
:
let mut code_page_ranges = CodePageRanges::default();
if unicodes_contains('Þ') && has_ascii {
code_page_ranges.set(0); // Latin 1
}
This then does everything at once; we're building up our final representation directly, as we add each new value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! this is much cleaner.
Co-authored-by: Colin Rofls <colin@cmyr.net>
@simoncozens feel free to merge this. I will implement the unicode range bits next so I may end up refactoring this part. I'm going to keep Colin's comments visible since they may help me later. |
Thanks very much for this - looks great! |
Still mega wip. I'm basing my implementation on ufo2ft which in turn is based on FontForge.