An example project where Rust code prints the length of an uploaded file.
Run python3 -m http.server (or equivalent: https://gist.github.com/willurd/5720255), then access http://[::]:8000/
Writing a "pdf file object parser". Starting with parsing individual objects. There are 8 types of objects:
- Boolean Objects
- Numeric Objects
- String Objects
- Name Objects
- Array Objects
- Dictionary Objects
- Stream Objects
- Null Object
We will also need to parse
- Indirect Object definitions (12 0 obj)
- Indirect object references (12 0 R),
- File structure: Header, body, cross-reference table, trailer.
It can now round-trip (objects, not yet an entire PDF file) via JSON. That is, if you dump to JSON and read back, you will get the exact same bytes.
- This is not as big a deal as it sounds, because we could in principle dump the sequence of bytes into JSON as an array of numbers. However, here we're doing slightly more than that.
Assumes the input is valid, e.g. does not check in dict for unique keys, does not check for stream length, etc.
Out of 19560 PDF files I have, this works correctly for 8724 of them.
As of 2022-04-30 (970471e): Works for 19262 out of 19560 files. So fails for 298 (not all of which are actually PDF files).
As of 2022-05-01 (child of 970471e): Works for 19430 out of 19562 files. So fails for 132 files.
As of 2022-05-01 (after deleting some dupes): Works for 19382 out of 19493 files. So fails for 111 files.
As of 2022-05-01 11:52: Works for 19420 out of 19493 files. So fails for 73 files.
As of 2022-05-01 14:20 (e54b45e): Works for 19426 out of 19492 files. So "fails" for 66 files. Looked at each of them. They are all malformed in some way or the other.