Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New cache layout broke VS code plugin #4069

Closed
ry opened this issue Feb 21, 2020 · 48 comments
Closed

New cache layout broke VS code plugin #4069

ry opened this issue Feb 21, 2020 · 48 comments

Comments

@ry
Copy link
Member

ry commented Feb 21, 2020

Previously Deno cached HTTP resources in a directory structure that was similar to the URL

It looked like this

$DENO_DIR/deps/https/deno.land/std/examples/curl.ts

This was a very simple and nice system as it's very obvious how URLs map to the cache. VS code plugin made assumptions about this layout.

However we discovered there are some URLs that don't easily map to this structure - in particular URLs with query strings, or URLs that don't have a file extension but serve TS/JS. (See #4030)

So now we use a cache structure which is flat and hashes URL

$DENO_DIR/deps/https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd

The VS Code plugin is very important - so let's figure out how to resolve this problem.

cc @axetroy @bartlomieju

@kitsonk
Copy link
Contributor

kitsonk commented Feb 21, 2020

What if there was a manifest in $DENO_DIR/deps that contained information like:

{
  "https://deno.land/foo/bar.ts": "https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd"
}

Potentially more data could be cached in the key, like if there were redirects from the requested URL to the final one.

@bartlomieju
Copy link
Member

Using manifest file was my initial thought, but we decided to skip it because URLs are consistently hashed. Adding such file, should not be a big problem though.

Waiting for @axetroy to let us know what are requirements for extension to make it work

@axetroy

This comment has been minimized.

@axetroy

This comment has been minimized.

@axetroy

This comment has been minimized.

@lucacasonato
Copy link
Member

@axetroy I secondary benefit of the hashed files was that it would fix the collision between importing 'example.com/mymodule' and 'example.com/mymodule/mod.ts', because before this would not work because 'directory could not be created because a file with name mymodule already exists'.

I also think creating huge manifest files will create some perf issues, but I think currently they are the best idea due to all of the benefits that the hashed file names provide. What if there was one manifest per origin?

@kitsonk
Copy link
Contributor

kitsonk commented Feb 22, 2020

I'm not sure it would be that horrible versus the current situation. We already write out headers and other stuff, and we don't expire stuff in the cache, so if we are intelligent about it, we wouldn't even touch the file unless we are actually writing out something to the cache. Also Deno shouldn't use it internally as an index or anything, it simply should write stuff out. If someone deletes only a part of a cache but doesn't delete the manifest file, that is there problem.

@bartlomieju
Copy link
Member

eg.
https://deno.land/std/examples/curl.ts
$DENO_DIR/deps/https/deno.land/std/examples/[hash1]

https://deno.land/std/examples/curl.ts?foo=bar
$DENO_DIR/deps/https/deno.land/std/examples/[hash2]

This will not work, because of the collisions @lucacasonato mentioned.

You can find https://deno.land/std/path/mod.ts which modules it depends on, but you don't know what the cache file > $DENO_DIR/deps/https/deno.land/72ee5916977ca9d8801c801f642353d811373786e51e3d7574cca966634b4f97 depends on

I'm not sure I understand this argument - contents of the file are not changed; it's just a filename in DENO_DIR that is changed compared to original request.
Also DENO_DIR is implementation detail of Deno and shouldn't really be played with by users.

I'm not sure it would be that horrible versus the current situation. We already write out headers and other stuff, and we don't expire stuff in the cache, so if we are intelligent about it, we wouldn't even touch the file unless we are actually writing out something to the cache. Also Deno shouldn't use it internally as an index or anything, it simply should write stuff out. If someone deletes only a part of a cache but doesn't delete the manifest file, that is there problem.

Agreed; providing a "manifest" file (which is basically mapping between path part of URL and filename in DENO_DIR) for each origin/host is very simple; if that would unblock the extension I can open a PR today.

@IllusionPerdu
Copy link

I don't think they are collisions in the @axetroy proposition the hash is calculate with the part of the last path : after the last / => filename.

@lucacasonato exemple :

https://example.com/mymodule mymodule is filename and not the directory =>
$DENO_DIR/deps/https/example.com/[hash1]

https://example.com/mymodule/mod.ts mod.ts is filename =>
$DENO_DIR/deps/https/example.com/mymodule/[hash2]

https://example.com/mymodule/ mymodule is the directory the filename is ''=>
$DENO_DIR/deps/https/example.com/mymodule/[hash3]

https://example.com/?mymodule/ ?mymodule/ is the filename because they are ?=>
$DENO_DIR/deps/https/example.com/[hash4]

the part after # is ignored because is a fragment of page

Sorry for my bad english ;)

@axetroy
Copy link
Contributor

axetroy commented Feb 22, 2020

@bardiarastin

Also DENO_DIR is implementation detail of Deno and shouldn't really be played with by users.

Deno users should not care about $ DENO_DIR

However, developers of third-party tools may need, eg, development of an editor extension / cache module update tool / dependency analysis, etc.

Agreed; providing a "manifest" file (which is basically mapping between path part of URL and filename in DENO_DIR) for each origin/host is very simple; if that would unblock the extension I can open a PR today.

This is probably the simplest solution and does not break the current cache layout.

But manifest may cause performance issues because it requires frequent file reading and serialization and deserialization

Why the new cache layout broke vscode extension?

Sorry i didn't describe it clearing why the new cache layout broke vscode extension

In older cache layout

Take the module https://deno.land/std/path/mod.ts as an example

Remote File: https://deno.land/std/path/mod.ts
Cache File: $DENO_DIR/deps/https/deno.land/std/path/mod.ts

if we open the cache file in editor you will see

// Copyright the Browserify authors. MIT License.
// Ported from https://github.com/browserify/path-browserify/

import * as _win32 from "./win32.ts";
import * as _posix from "./posix.ts";

import { isWindows } from "./constants.ts";

const path = isWindows ? _win32 : _posix;

export const win32 = _win32;
export const posix = _posix;
export const resolve = path.resolve;
export const normalize = path.normalize;
export const isAbsolute = path.isAbsolute;
export const join = path.join;
export const relative = path.relative;
export const toNamespacedPath = path.toNamespacedPath;
export const dirname = path.dirname;
export const basename = path.basename;
export const extname = path.extname;
export const format = path.format;
export const parse = path.parse;
export const sep = path.sep;
export const delimiter = path.delimiter;

export { EOL, SEP, SEP_PATTERN, isWindows } from "./constants.ts";
export * from "./interface.ts";
export * from "./glob.ts";
export * from "./globrex.ts";

Q: So how does the extension find the . /win32.ts module?
A:

we can get the current cache file's url(https://deno.land/std/path/mod.ts) from filepath($DENO_DIR/deps/https/deno.land/std/path/mod.ts)

so. The module./win32.ts url will be https://deno.land/std/path/win32.ts. and cache file $DENO_DIR/deps/https/deno.land/std/path/win32.ts

Everything is alright.

In new cache layout

Remote File: https://deno.land/std/path/mod.ts
Cache File: $DENO_DIR/deps/https/deno.land/[hash4mod]

Then in file $DENO_DIR/deps/https/deno.land/[hash4mod], we cannot find module ./win32.ts

./win32.ts module's cache file should be $DENO_DIR/deps/https/deno.land/[hash4win32]

hash4win32 = SHA256('/std/path/win32.ts')

As long as we can calculate the hash4win32, then we can find the ./win32.ts module

Unfortunately, we don’t know what the current URL.

So we need a mapping to get url: $DENO_DIR/deps/https/deno.land/[hash4mod] > https://deno.land/std/path/mod.ts

Here we get /std/path/mod.ts and then get /std/path/win32.ts

Finally, we can calculate hash4win32 and resolve module.

@axetroy
Copy link
Contributor

axetroy commented Feb 22, 2020

@xpyxel
Deno may not need it because deno run xxx.ts only reads the file once

But in the Typescript plugin, it does need to be read frequently

Every time the file changes\recompilation in editor, needs to read-> serialize.

@hqnna
Copy link
Contributor

hqnna commented Feb 22, 2020

Yeah I realized that after I commented, sorry about that lol @axetroy

@bardiarastin
Copy link
Contributor

Hi, this is the third time that I'm being mentioned instead of someone else on this project but for the third time THANK YOU FOR ALL THE HARD WORK ON DENO ❤️

@bartlomieju
Copy link
Member

@IllusionPerdu thanks for the example, but the problem is you used URLs which all could be valid filenames. Take for example jsdom; the very first import statement reads:
import { dew } from "/npm:jsdom@16.2.0/lib/api.dew.js";. AFAIK colon cannot be used in filename on Windows, so this solution will not work.

@bartlomieju
Copy link
Member

@axetroy thanks for explanation, now I see what's the problem clearly. I don't have a good solution off top of my head right now besides the manifest/mapping file. I understand your concerns about serialization performance; however maybe we should start with this solution as it's pretty simple and see if it's really too slow?

@axetroy
Copy link
Contributor

axetroy commented Feb 22, 2020

@bartlomieju
I'm not sure if it's really slow, but we can try, at least it won't break the current cache layout.

And, can you print what the manifest/mapping file looks like if use this solution?

@bartlomieju
Copy link
Member

@axetroy I think we can do it per origin (so per directory in cache)

// $DENO_DIR/deps/deno.land/manifest.json

{
  "/std/testing/asserts.ts": "<SHA1 hash>",
  "/std@v0.34.0/fmt/strings.ts": "<SHA1 hash>",
  "/x/some?other=path": "<SHA1 hash>",
  ...
}

Would that work?

@axetroy
Copy link
Contributor

axetroy commented Feb 22, 2020

@bartlomieju it works and look good for me.

@bartlomieju
Copy link
Member

Good, I'll get to implementing it shortly

@axetroy
Copy link
Contributor

axetroy commented Feb 23, 2020

1

work fine with manifest.json

@IllusionPerdu
Copy link

IllusionPerdu commented Feb 23, 2020

You would have to add a command in Deno to check the consistency between the cache files and the manifest because for some reason someone can manually delete files in the cache or edit the manifest.

For security, the manifest could by the same designation define an imprint of the files.

// $DENO_DIR/deps/deno.land/manifest.json

{
"/std/testing/asserts.ts": {file:"", crc:""},
"/std@v0.34.0/fmt/strings.ts": {file:"", crc:""},
"/x/some?other=path": {file:"", crc:""},
...
}

@ry
Copy link
Member Author

ry commented Feb 24, 2020

I'm worried about the speed of manifest.json when it's filled with many entries. Reading and writing to the cache becomes O(N) where it was previously O(1).

@kitsonk
Copy link
Contributor

kitsonk commented Feb 24, 2020

Would a distributed manifest be a problem. We already write out the headers, so write out in some sort of meta data file the original URL? That would put the work on the plugin to go and collect all the meta-data files to create a map, and what the FS events on the DENO_DIR for any new files and add them to the cache. For Deno it would be still be O(1).

@ry
Copy link
Member Author

ry commented Feb 24, 2020

@kitsonk By distributed manifest you mean having a manifest per directory? I guess that would work... but I'm still wondering if we can't avoid the manifest entirely.

@axetroy I'm a unclear why the VS Code plugin cannot call "deno info" ? We can modify "deno info" to fit your needs. It would be useful if the DENO_DIR was a private structure - that would allow us to do refactors more easily.

@axetroy
Copy link
Contributor

axetroy commented Feb 24, 2020

@ry Yes, I agree to privatize $DENO_DIR

This way, even if the cache layout changes again, third-party tools will not need to change

If we no longer depend on $DENO_DIR

Then deno info needs to provide the mapping function of url <---> filepath

eg.

$ deno info $DENO_DIR/https/example.com/[hash]
# output
https://example.com/demo/mod.ts

$ deno info https://example.com/demo/mod.ts
# output
$DENO_DIR/https/example.com/[hash]

@axetroy
Copy link
Contributor

axetroy commented Feb 24, 2020

@ry

I'm a unclear why the VS Code plugin cannot call "deno info" ?

Commands can be called, but synchronously

This will cause the module parsing to be very slow, as it needs to frequently call deno info xxx

@ry
Copy link
Member Author

ry commented Feb 24, 2020

@axetroy how about we add an option to “deno info” so it will not download nor compile but only look in the cache? We can also add an option for JSON output.

Could you provide a link to where in the plugin “deno info” would be called?

@axetroy
Copy link
Contributor

axetroy commented Feb 24, 2020

@ry

@axetroy how about we add an option to “deno info” so it will not download nor compile but only look in the cache? We can also add an option for JSON output.

SGTM

In fact, I have already implemented the solution of mianfest.json

here is how extension found Deno module.

https://github.com/axetroy/vscode-deno/blob/894e9785d3ac9504bc62d8193983938365f5df7d/typescript-deno-plugin/src/plugin.ts#L259-L309

But I rethought a while andI think deno info xx is feasible

@axetroy
Copy link
Contributor

axetroy commented Feb 26, 2020

@bartlomieju

It seems that @ry is disagree with the mianfest.json solution, and wanna privatizes $DENO_DIR

So, If we use the deno info solution and make APIs similar to the following as much as possible

# locate from filepath
deno info $DENO_DIR/deps/https/deno.land/[hash] --cache-only --output=json
# locate from URL
deno info https://deno.land/std/path/mod.ts --cache-only --output=json

Stdout output:

{
  "filepath": "$DENO_DIR/deps/https/deno.land/[hash]",
  "url": "https://deno.land/std/path/mod.ts",
  "type": "TypeScript"
}

misc

If you use this approach, in TypeScript, you really need to call deno info frequently.

So I hope there are flags to make the json as short as possible

eg. module's dependency tree is optional. There are some modules(https://dev.jspm.io/jsdom) with a lot of dependencies and it is not necessary.

@bartlomieju
Copy link
Member

@axetroy yes, I discussed this issue with @ry

There's one thing I am not sure about:

# Map URL to filepath
deno info https://deno.land/std/path/mod.ts --cache-only --output=json

Above example is self-explanatory and easy to implement; what about the one below:

# Map filepath to URL
deno info $DENO_DIR/deps/https/deno.land/[hash] --cache-only --output=json

Since we want to privatize DENO_DIR then why do you need "reverse lookup"? Was that mistakenly put there, or you actually need the same functionality for "hashed" filenames?

That's also ref for #4140

@axetroy
Copy link
Contributor

axetroy commented Feb 26, 2020

Since we want to privatize DENO_DIR then why do you need "reverse lookup"? Was that mistakenly put there, or you actually need the same functionality for "hashed" filenames?

https://deno.land/std/path/mod.ts to $DENO_DIR/deps/https/deno.land/[hash] just need to calculate SHA256 of /std/path/mod.ts and it's easy.

But what I really need is map $DENO_DIR/deps/https/deno.land/[hash] to https://deno.land/std/path/mod.ts

Because in actual use case, I open the $DENO_DIR/deps/https/deno.land/[hash] file in the editor. I have to figure out what its URL is in order to find where its dependent modules are.

If without additional information, it seems that Deno cannot infer what its URL is from $DENO_DIR/deps/https/deno.land/[hash]

@lucacasonato
Copy link
Member

lucacasonato commented Feb 26, 2020

This might be stupid, but why do we actually hash the file names in the first place? To make sure we have a valid set of characters for the filesystem right? Not for security. Couldn't we just use base64 encoding or a fast terrible encryption algorithm? The name doesn't need to be one way hashed because it needs to be secure - it just needs to be rewritten to valid filesystem chars.

base64 is reversible so that would fix the issue right?

@bartlomieju
Copy link
Member

bartlomieju commented Feb 26, 2020

This might be stupid, but why do we actually hash the file names in the first place? To make sure we have a valid set of characters for the filesystem right? Not for security.

That's right

Couldn't we just use base64 encoding or a fast terrible encryption algorithm? The name doesn't need to be one way hashed because it needs to be secure - it just needs to be rewritten to valid filesystem chars.

base64 is reversible so that would fix the issue right?

Now that I think about it, it might actually work, but it would rather be Base32 - that way we'd handle both case-sensitive and case-insensitive file systems.

EDIT: There might be the issue of filename length though.

@lucacasonato
Copy link
Member

lucacasonato commented Feb 26, 2020

You're right - filename length could be an issue (base32 will increase file length by an average of 60%). A solution could be to split the base32 encoded strings into 255 char segments (this would fit 99% of filesystems), and then have a folder structure like this: $DENO_DIR/deps/https/segment1/segment2/segment3. Short file names would just be $DENO_DIR/deps/https/segment1.

Stitching these back together into a single string or splitting them apart is trivial.

@kitsonk
Copy link
Contributor

kitsonk commented Feb 26, 2020

Just to point out, base32 or base64 do seem a bit of overkill... they are designed to encode arbitrary data. URLs can only contain a certain set of characters. Though I am not aware of an encoding that is file safe that only deals with encoding valid URL characters.

The following text:

raw.githubusercontent.com/microsoft/TypeScript/v3.7.2/lib/lib.dom.d.ts

Encodes in base-32 as:

OJQXOLTHNF2GQ5LCOVZWK4TDN5XHIZLOOQXGG33NF5WWSY3SN5ZW6ZTUF5KHS4DFKNRXE2LQOQXXMMZOG4XDEL3MNFRC63DJMIXGI33NFZSC45DT

Also, from an encoding perspective, domains are case-insensitive, while that path and query are case sensitive.

@hqnna
Copy link
Contributor

hqnna commented Feb 27, 2020

Why not try Punycode?

The example string you used above:
raw.githubusercontent.com/microsoft/TypeScript/v3.7.2/lib/lib.dom.d.ts

would result in the following in punycode:
a8d609a3d662b370256d531e451b7d1781d78e7a

compared to the Base32 example you provided, it is 73 characters less.

@kitsonk
Copy link
Contributor

kitsonk commented Feb 27, 2020

@xpyxel how were you encoding it via punycode?

Punycode usually is for taking non-ASCII UNICODE chars and encoding them in a URL. When I do a punycode encoding of example https://😱.com/ would become https://xn--s38h.com/. For the example string, I get exactly what is in the example string, because they are valid encodings.

@hqnna
Copy link
Contributor

hqnna commented Feb 27, 2020

@kitsonk seems it was just a bug in the website i was using to test different encodings.

@brandonkal
Copy link
Contributor

Why not use something like https://github.com/mit-pdos/noria (Rust)
or even SQLite instead of a JSON table to perform the reverse lookup?

That should be more performant than constantly parsing, stringifying, and writing JSON to disk.

@brandonkal
Copy link
Contributor

Still, a descriptive filename should be considered a priority. Even with a language server, opening files which don't have the expected filename is not ideal, especially in other editors. At the very least, it should be:

$DENO_DIR/deps/https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd/curl.ts

instead of

$DENO_DIR/deps/https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd

There are only a few characters which are invalid in filenames:

/<>:"/\|?*`

I would think a better approach would be to replace those characters and then always add a file extension for http/https imports. So curl.ts becomes curl.ts.ts which allows imports that don't have an extension (it is added) and it doesn't conflict with a url that just ends with curl (it becomes curl.ts on disk).

@kitsonk
Copy link
Contributor

kitsonk commented Feb 28, 2020

@brandonkal I think it is unavoidable once you consider query strings would also make unique resources in the cache.

I also think you are missing & in your list of characters too.

@piscisaureus
Copy link
Member

There are only a few characters which are invalid in filenames:

You also have to consider:

  • URLs are case sensitive whereas some filesystems are not.
  • On Windows, file names can not end with a dot or a space.
  • Certain characters may be legal in file names, but are nonetheless undesirable because they can produce weird effects when used with xargs or similar tools. I don't know exactly which ones matter on unix, but I think ~ and ' are among them. On windows I would avoid % and ! and probably some others too.
  • In URLs, directory names can be "empty". That is "http://foo.land///" != "http://foo.land/", and "http://foo.land///../x" gets normalized to "http://foo.land//x".
  • The length of individual path components in an URL may exceed what the filesystem allows.

So, although I do understand where the desire to have recognizable file names comes from, it is not as straightforward as you make it sound -- there are many more gotchas.

As a compromise I could see a solution where we strip clusters of undesirable characters from the URL and replace them by an underscore, strip the ?query&etc=42 part, and then append a hash of the entire URL to it. So e.g /some...thing...//WUT* //odd.ts?blah becomes /some_thing_/wut_/_/odd.ts-1cb1ab6c0.

@brandonkal
Copy link
Contributor

I'm not sure why query strings changes things unless the old behaviour was to strip them off before creating a file path. If you replace ? with -1- for instance (numbering each invalid character) it doesn't seem to be an issue.

As far as I am aware, '&' is legal on unix and Windows but it does create issues with shell interpretation so it would be good to add that to the list.

@brandonkal
Copy link
Contributor

Good points. Something like

"$DENO_DIR/[id-in-DB or hash]/[stripped-file-basename].[appended-extension-from-content-type]"

would be a nice compromise for readable filenames in editors and search.

ry added a commit that referenced this issue Feb 28, 2020
Add original URL to metadata. This is so the VS Code Plugin can reverse
look up the URL for cache entries. Ref #4069.
@ry
Copy link
Member Author

ry commented Feb 29, 2020

@axetroy In 0099c28 I changed the cache so that in the download cache you have $HASH.metadata.json. In there you will find the original URL. This is a stop gap solution - ultimately I want to improve deno info - but maybe this will unblock you in 0.35.0.

@axetroy
Copy link
Contributor

axetroy commented Feb 29, 2020

@ry I have built Deno v0.35.0 locally and it works find.

waiting v0.35.0 release, then extensions will be released soon

@axetroy
Copy link
Contributor

axetroy commented Feb 29, 2020

vscode-deno@3.0.0 have been release and it works again with Deno v0.35.0, it is up to @ry to decide whether to close the issue

If implement the deno info API in the future or refactor a new cache layout, @ me again

1

@ry
Copy link
Member Author

ry commented Feb 29, 2020

@axetroy Awesome - glad to hear it's unbroken. I will close this issue but keep working on the "deno info" improvements.

@ry ry closed this as completed Feb 29, 2020
mhvsa pushed a commit to mhvsa/deno that referenced this issue Mar 6, 2020
Add original URL to metadata. This is so the VS Code Plugin can reverse
look up the URL for cache entries. Ref denoland#4069.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants