I was planning to release 0.3 as discussed, with the change in
join such that supplying columns to an equi-join requires an additional character to make explicit columns vs. conditions:
-join salaries [id] # implicit
+join salaries [~id] # explicit
join l=locations [e.office_address == l.address]
This is because a bare
id is treated as a bool condition, like
e.office_address == l.address is treated. (and in theory,
id could be a bool column)
I'm still fine to do the release and assess feedback, as discussed. But I wanted to raise whether we should:
- consider making a change back to the previous implicit behavior before the release, given that this would be a breaking change to a fundamental part of the language. If we're 80/20 on allowing the previous implicit behavior, making the change now would avoid the breaking change without that much cost — as others pointed out yesterday, many of the immediate benefits come from better development on
main, rather than big user-facing changes in a version number.
- consider adding something to help users in making the transition; e.g. an error on the existing approach
On the specific language change, I see it as a tradeoff between syntactic simplicity and semantic simplicity:
- Syntactic simplicity / brevity — joining on a shared column is very frequent, particularly in a well-designed schema. Things that are expressed frequently should have small[^1] representations.
- Almost never are we actually joining based on a bool column. (I hadn't even realized that the previous design was ambiguous). It would be doing a cross-join based on a bool column from one of the tables — so unlikely.
- What do folks think about the extra complexity for users? Possibly it's actually not that much burden to understand "Represent [
USING / an equi-join / a join between identically named columns] with
~"? (Maybe it feels bigger to us because it's a change and it's so prominent in our docs and examples?)
- Semantic simplicity / generality — having
[id] mean something different from
[id==true] in a
join breaks the encapsulation of the expression. The compiler needs to understand what's inside the expression; uncorrelated concepts become correlated, the language becomes less general and less orthogonal.
For example, an unlikely but possible example — is
bar a condition or a column in the join? I guess it's a condition because we know it's
baz==bax. But if
bar were materialized in a column in the DB, then the behavior suddendly changes.
derive bar = baz==bax
join x [bar] # is `bar` an implicit column in both tables? Or `baz==bax`?
I've been supportive of #919, which increases generality, this would go against that theme
If anyone has ideas for an alternative representation rather than
~, then feel free to suggest! Though I actually think that
~ is pretty good.
One alternative would be to have a different parameter; e.g.
using:[id], but then given the conditions parameter would still be required, we'd have an awkward
join locations using:[id] .
If we do go the explicit route, is there something we can do to make this clearer for users? I would find this quite confusing if I weren't watching the releases and all of a sudden this compiles to something completely different:
join x [bar]
- JOIN x USING(bar)
+ JOIN x ON bar
PRQL has a higher ratio of expectations&excitement vs. users than most projects, so it's fine to make breaking changes atm. But this is potentially quite severe. Assuming we go the explicit route, should we raise an error for a bare column name for a few versions so it's at least obvious when people do this?
Without wanting to zoom out too far, possibly it's worth considering this in the context of overall joins; e.g. #716 & #723
Where do folks end up? As I said prior, @aljazerzen has full rights to respond with 😫, and I'll do the release.
semantic was really Herculean, and we're still young enough that we probably underrate velocity.
For transparency, if we do decide to make the change, I'm flat-out with non-PRQL stuff until mid-week, after which I have more time and would be happy to work on this. I'm quite excited to get into working with the new compiler!
[^1]: "small" here means both in character-count and syntactic complexity, in this case
~ is small in character-count but adds to syntactic complexity. For theory around compression, check out source-coding, and I can find better references if folks are interested