Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output JSON schema during build process #176

Closed
lylemoffitt opened this issue Dec 24, 2016 · 16 comments
Closed

Output JSON schema during build process #176

lylemoffitt opened this issue Dec 24, 2016 · 16 comments

Comments

@lylemoffitt
Copy link

It would be great if Serde could optionally produce a JSON schema as a side-effect of the build process. AFAIK it has all the information it needs to write one. You just need to translate the structs/enums to their appropriate schema representations (read: matching JSON type).

Additional:

While the above is an awesome starting block, it would also be really nice if you could compile-time check that Serde's JSON will validate against an externally provided schema. This isn't totally necessary, as you could do this after the fact with a tool like ajv. It would just provide stronger guarantees if it was compile-time checked.

Motivation

  • Compatibility: Presently there is no way to guarantee that JSON produced by Serde is compatible with another framework. We can only write tests against JSON samples and write code to match an API spec. We have no way of knowing if either of them is up-to-date or correct.
  • Extendability: Having a portable artifact of your data-representation is an enormously useful tool. In many dynamic languages, you can auto generate data bindings and UIs provided a schema. This allows devs to quickly develop across platforms and languages while maintaining integrity of their data.

Anticipated Questions:

  • Why Serde? - Serde already has all of the user-facing hardware necessary to produce a schema. Using attributes and types already in the user's code makes adding this feature "free" and to existing libraries.
  • Why at compile-time? - Validating against a schema at compile-time enables devs to "Hack without fear", because they will know that they are properly encoding their data types. It allows devs to easily update their code and immediately know if their schema/data-bindings are out of date.
@dtolnay
Copy link
Member

dtolnay commented Dec 24, 2016

It would be great if Serde could optionally produce a JSON schema as a side-effect of the build process. AFAIK it has all the information it needs to write one. You just need to translate the structs/enums to their appropriate schema representations (read: matching JSON type).

This is tricky. It seems like Serde has all that information but in reality we don't. At compile time, we have type information about only one struct/enum at a time. For example we might know that your struct S has a field whose type is a::b::c::D<T> but there is no way to use that to find out more about D's type or what types T is used as. Serde's job is to stamp out some code that implements Serialize/Deserialize, and for that we blindly assume that a::b::c::D already implements Serialize/Deserialize without knowing anything else about it.

Conversely at runtime Serde deals with values only, not types. So there would be no way for Serde to come up with a JSON schema for a particular type unless we can get a concrete instance of that type from somewhere and try serializing it. And even then we would be blind to any shenanigans the type might try to do, like serializing itself as an integer if the day of the month is prime and serializing itself as a string otherwise.

serde-rs/serde#345 is tracking a similar request.

If this is a feature you need (rather than "wouldn't it be great if..."), I think a more promising place to start would be implementing this as a compiler plugin, similar to how Clippy works. I think they have a lot more type information at that stage than what we have in Serde.

@lylemoffitt
Copy link
Author

@dtolnay

It seems like Serde has all that information but in reality we don't.

I'm not really familiar with the implementation of Serde, but from your description it sounds like you have enough information, you just don't have it all at once. There is a set of unconnected (from the point of view of Serde) points of analysis:

  1. When Serde derives struct S, it knows the names of it's fields (what the schema calls properties), and their types (but not if they're templates?).
  2. When Serde derives a::b::c::D<T>, and assuming it isn't also a struct, it must determine what JSON type it serializes to, or what named values it serializes to if it's a string enum.
    Even if these points aren't connected, they can still be defined individually, and combined after the fact. Even if the type information is missing for a sub-field, when the struct is added to the schema, you can reference it by name and it will work later. Since all types used in a struct must must be serializable (i.e. implement or derive Serialize) in order to be able to serialize it later, we can know that at least by the end of the compilation, the schema will be complete.

If this is a feature you need...

This isn't something I need in the long run, it's more something I want for Rust in the long run. Having something like this would really help make it even more compelling for web devs and server architects (not exclusively, just especially). If this is something that's really, truly, impossible for Serde, my next recommendation (if anyone is watching) is to use external mixins to call one of the JS, Python, or Ruby libraries that could get this done. But that would at best be a hack, and not really the most appropriate thing for Rust long term. I agree Clippy would be a good choice for linting (validating) against the schema.

@lylemoffitt
Copy link
Author

Referring to what @oli-obk said on serde-rs/serde#345:

Nope. Serde works with values of concrete types. You want to work with concrete types directly.

Could you elaborate?

@dtolnay
Copy link
Member

dtolnay commented Dec 24, 2016

Nope. Serde works with values of concrete types. You want to work with concrete types directly.

Could you elaborate?

This is the same thing I was getting at above:

Conversely at runtime Serde deals with values only, not types. So there would be no way for Serde to come up with a JSON schema for a particular type unless we can get a concrete instance of that type from somewhere and try serializing it.

In other languages like Java or Go, data serialization is typically built on runtime reflection which lets you do things like this at runtime:

  • @lylemoffitt: Language, give me information about the type S.
  • Language: S is a struct type with 6 fields.
  • @lylemoffitt: Give me information about the first field of S.
  • Language: It is called x and the type is com.enterprise.Foo, which itself is a struct.
  • @lylemoffitt: Tell me more about the type com.enterprise.Foo.
  • Language: Sure, no problem.

Rust does not have runtime reflection.

  • @lylemoffitt: Rust, give me information about the type S.
  • Rust: lol you wish 🤓

Instead Serde serializes structs by generating code at compile time to serialize them.

  • @lylemoffitt: Please tell me any time someone writes #[derive(Serialize)]
  • Rust: bro, someone just wrote that on "pub struct S { x: Bar }"
  • @lylemoffitt: Cool so the first field is called x? What type is Bar?
  • Rust: lol I have no idea
  • @lylemoffitt: Is it in the same crate as S? Does it implement Serialize?
  • Rust: it's called Bar bro, your guess is as good as mine
  • @lylemoffitt: Is it a real type though? Like are you going to be able to compile it later?
  • Rust: maybe? chill out bro, we'll find out
  • @lylemoffitt: I guess "impl Serialize for S { /* ... */ }"
  • Rust: someone just wrote #[derive(Serialize)] on "pub enum Bar { /* ... */ }"
  • @lylemoffitt: Cool, that's the same Bar as before right?
  • Rust: I can't tell you

Hopefully that clarifies the limitations of putting together a complete JSON schema at Serde's position in the stack. I didn't know about "$ref", thanks for pointing that out. Possibly we could do something like generate an individual schema for each struct that we process that uses "$ref" to refer to all of the fields, but there is no guarantee that you would be able to wire those together later. For example if someone does use A as B we would generate a ref to B and a schema for A with no indication that those are the same type.

In very simplified form, compilation works like this:

  • Compile any other crates that are transitive dependencies of the current crate.
  • Parse the current crate. This produces very basic syntactic information. Is this Bar the same as that Bar? Don't know.
  • Do macro expansion, which includes the Serde code generation personified above.
  • Resolve what each identifier refers to, do type inference, type checking, borrow checking and a lot of other magic.
  • Run compiler plugins like Clippy. These get to take advantage of a lot more type information that Serde does, but are also much more unstable and break more often.
  • Make an executable.

That is why I suggested Clippy as a more promising starting point for this. Not that JSON schema should be built into Clippy, but that something at that stage would have all of the relevant information available and would be able to do the job better. It may make sense for the Serde team to own this functionality but basically none of what we already have is going to be helpful, so we would be starting from scratch whether it goes in Serde or into a separate library.

@lylemoffitt
Copy link
Author

Excellent explanation. Really clears things up! Love your personification of Rust.

To summarize:

  1. Serde has the names of types, but not the actual types.
  2. Serde exists only during the macro expansion phase and not during the type resolution phase.

This leads me to some questions:

  1. Doesn't Serde have a compiler plugin (on nightly)? I understand that you're working towards Macros 2.0, but aren't quite there yet. Does this change what phase Serde runs in, or just add an extra phase?
  2. Would it be possible to have Serde generate a partial, intermediate, representation of a schema that could later be fixed and updated by a compiler plugin?
  3. I found an implementation of RTTI for Rust. Is this any different than what you're doing with Serde (on stable)? Looking at the source, it looks like mostly macros.
  4. Is type info for other, dependent crates available during macro expansion?

Thank you for bearing with me by the way.

@oli-obk
Copy link
Member

oli-obk commented Dec 24, 2016

It would be possible to add a static method on the Serialize trait that has a Sized bound. This method would then return some enum describing the type (which might require calling the method on the field's type). You'd basically get a runtime representation of the type, barring any information only available if you have a concrete instance. At that point we're replicating the rtti crate but with macros 1.1. Thus I think this should be outside the serialize/deserialize traits, but I could see it being in serde.

Even then, you can't do compile time checks, but you can simply write a unit test validating the format against a schema

@dtolnay
Copy link
Member

dtolnay commented Dec 24, 2016

  1. Doesn't Serde have a compiler plugin (on nightly)? I understand that you're working towards Macros 2.0, but aren't quite there yet. Does this change what phase Serde runs in, or just add an extra phase?

Serde has a "procedural macro" for nightly (will be stabilized in Rust 1.15 in February). This is the same phase as what used to be the serde_macros "compiler plugin," just the mechanics have changed in order to stabilize a part of it. My talk at the most recent SF meetup discusses how these work (start at 3:00).

  1. Would it be possible to have Serde generate a partial, intermediate, representation of a schema that could later be fixed and updated by a compiler plugin?

@oli-obk can speak more to how compiler plugins work since he has contributed extensively to Clippy. My current understanding is that it would be better to just do it all in a compiler plugin. It should have all the same syntactic information plus also type information.

  1. I found an implementation of RTTI for Rust. Is this any different than what you're doing with Serde (on stable)? Looking at the source, it looks like mostly macros.

As @oli-obk responded above, this is equivalent to what Serde is doing and has all the same limitations.

  1. Is type info for other, dependent crates available during macro expansion?

No, just syntax-level information about the current crate. Think of it as nothing more than the textual source code of the current crate.

@oli-obk
Copy link
Member

oli-obk commented Dec 24, 2016

Using a compiler plugin to spit out a schema for a type is almost trivial once you get a handle of the type. Clippy actually has a very rudimentary inspection lint that almost does that. I'd be happy to mentor any extensions to it

@lylemoffitt
Copy link
Author

@dtolnay Thank you for referring me to that video! I was looking for something more recent than the RustConf2016 I had seen. Was this video on TWiR? I hadn't seen it there...

If it's ok, I have some questions about your talk:

  1. Do you have to use and AST in the middle, or will any old string manipulation work (assuming the result is valid rust code)?
  2. Can I redirect the generated code to another file?
  3. Does the derived source code replace the written code, or just append it?
  4. Can quote! generate further derive attributes or other macro calls that will later be expanded?

@lylemoffitt
Copy link
Author

@oli-obk Could you refer me to specific lint you're talking about?

@lylemoffitt
Copy link
Author

@dtolnay Thanks for pointing out Valico! I totally didn't know they had a schema building & validation. That really helps solve a chunk of this problem. I don't know if a direct port is the best thing though. If you're adding it to Serde, it's probably best to try and generalize to try and support schema for other languages. Is Serde's existing par/gen infrastructure something that would be helpful here?

@dtolnay
Copy link
Member

dtolnay commented Dec 25, 2016

If it's ok, I have some questions about your talk

I moved it to https://github.com/dtolnay/talks/issues/1 to not derail the discussion here:balloon:.

If you're adding it to Serde, it's probably best to try and generalize to try and support schema for other languages. Is Serde's existing par/gen infrastructure something that would be helpful here?

Good call, but whether we generate JSON schema directly or a higher-level broadly applicable Serde schema, eventually we will need a way to get a JSON schema so I would rather reuse an existing high-quality implementation of that.

@oli-obk
Copy link
Member

oli-obk commented Dec 25, 2016

@dtolnay dtolnay changed the title Feature request: Output JSON schema during build process Output JSON schema during build process Feb 17, 2017
@dtolnay
Copy link
Member

dtolnay commented May 7, 2018

I would be interested in seeing this handled by a separate crate dedicated to JSON schema.

@dtolnay dtolnay closed this as completed May 7, 2018
@H2CO3
Copy link
Contributor

H2CO3 commented Jul 14, 2018

@lylemofitt @dtolnay FYI, I'm working on a crate for generating MongoDB-flavored JSON schemas: https://github.com/H2CO3/magnet — it's not generating 100% standard-compliant JSON schema because MongoDB's spec is more precise and powerful (and I need it for document validation), but it's close, and I think my approach is pretty reasonable should someone want to extend/build upon it.

@realcr
Copy link

realcr commented Jan 14, 2020

For future readers that seek a solution, I found this repository:
https://github.com/GREsau/schemars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants