A guide for Mozilla's developers and data scientists to analyze and interpret the data gathered by our data collection systems.

Overview

Mozilla Data Documentation

Build Status

This documentation was written to help Mozillians analyze and interpret data collected by our products, such as Firefox and Mozilla VPN.

At Mozilla, our data-gathering and data-handling practices are anchored in our Data Privacy Principles and elaborated in the Mozilla Privacy Policy.

To learn more about what data Firefox collects and the choices you can make as a user, please see the Firefox Privacy Notice.

The rendered documentation is hosted at https://docs.telemetry.mozilla.org/.

Issues for this documentation are tracked in Bugzilla (file a bug).

Building the Documentation

The documentation is rendered with mdBook. We use a fork named mdbook-dtmo that includes a number of custom additions to mdbook for our environment (for example, a plugin to automatically generate a table-of-contents).

You can download mdbook-dtmo on the GitHub releases page. Please use the latest version. Unpack it and place the binary in a directory of your $PATH.

If you have rustc already installed, you can install a pre-compiled binary directly:

curl -LSfs https://japaric.github.io/trust/install.sh | sh -s -- --git badboy/mdbook-dtmo

Make sure this directory is in your $PATH or copy it to a directory of your $PATH.

You can also build and install the preprocessors:

cargo install mdbook-dtmo

You can then serve the documentation locally with:

mdbook-dtmo serve

The complete documentation for the mdBook toolchain is available online at https://rust-lang.github.io/mdBook/. If you run into any problems, please let us know. We are happy to change the tooling to make it as much fun as possible to write.

Spell checking

Articles should use proper spelling, and pull requests will be automatically checked for spelling errors.

Technical articles often contain words that are not recognized by common dictionaries, if this happens you may either put specialized terms in code blocks, or you may add an exception to the .spelling file in the code repository.

For things like dataset names or field names, code blocks should be preferred. Things like project names or common technical terms should be added to the .spelling file.

The markdown-spell-check package checks spelling as part of the build process. To run it locally, install node.js (if not already installed) and run npm install at the root of the repository. Then run the scripts/link_check.sh script.

You may also remove the --report parameter to begin an interactive fixing session. In this case, it is highly recommended to also add the --no-suggestions parameter, which greatly speeds things up.

Link checking

Any web links should be valid. A dead link might not be your fault, but you will earn a lot of good karma by fixing a dead link!

The markdown-link-check package checks links as part of the build process. Note that dead links do not fail the build: links often go dead for all sorts of reasons, and making it a required check constantly caused otherwise-fine pull requests to appear broken. Still, you should check the status of this check yourself when submitting a pull request: you can do this by looking at the Travis CI status after submitting it.

To run link checking locally, run the installation steps described for spell checking if you haven't already, then run the scripts/link_check.sh script.

Markdown formatting

We use prettier to ensure a consistent formatting style in our markdown. To reduce friction, this is not a required check but running it on the files you're modifying before submission is highly appreciated! Most editors can be configured to run prettier automatically, see for example the Prettier Plugin for VSCode.

To run prettier locally on the entire repository, run the installation steps described for spell checking if you haven't already, then run the scripts/prettier_fix.sh script.

Contributing

See contributing for detailed information on making changes to the documentation.

Comments
  • Replace gitbook with mdbook

    Replace gitbook with mdbook

    This replaces gitbook with mdbook.

    It currently uses the mdbook-dtmo wrapper, which bundles plugins for mermaid and ToC. Binary releases for Linux, Mac and Windows are provided on the release page (though Windows untested). It's based on the latest git version of mdbook, which includes relevant bug fixes. I intend to update it to a stable released version of mdbook as soon as it gets a release.

    Upstreaming/changing how we do the mermaid/toc processing is tracked in #186.

    Known remaining issues:

    • I had to add empty chapters for some parts, see commit https://github.com/mozilla/firefox-data-docs/commit/ffdc0b09fc81aa9cec0d398a37d6d20380680016
    • I had to remove the top-level "Bug template" link and moved it to the "Contributing" chapter

    Live view: https://badboy.github.io/firefox-data-docs/ Fixes #162, #149

    I can hold this back until the currently open PRs are merged and rebase it on top of that or I can help with rebasing the PRs after the fact.

    opened by badboy 20
  • adding best practices for telemetry events design

    adding best practices for telemetry events design

    Added cookbook for telemetry events design best practices. RE this bug

    Not sure where in the docs this should live, so put it into cookbooks by default.

    opened by SuYoungHong 9
  • Describe Weekday regulars v1 and All-week regulars v1

    Describe Weekday regulars v1 and All-week regulars v1

    DAU shows strong weekly seasonality. Some of this is due to users who are only present Monday-Friday (in their timezone). Weekday-only clients likely have a different usecase to all-week users: they may be on work or school computers. And post-COVID we saw a drop in weekday-only clients and an increase in all-week clients - so it’s worth tracking these separately.

    “Weekday regulars v1” and “All-week regulars v1” are sub-segments of regular users v3: we do not try to classify non-regular users, because we may not have enough data points at hand to confidently classify a non-regular user as predominantly using the browser on weekdays. A broader usage threshold than “regular users v3” could have been chosen here, but nevertheless some threshold would have needed to be chosen: so for the sake of simplicity let’s just compute this for regular users v3.

    Timezones: the submission_date of a ping does not necessarily match the local day that the browser was used. Monday in Australia starts on Sunday UTC. Therefore we should allow some flexibility around what is considered a “weekend”: for an individual client the weekend could be Friday/Saturday, Saturday/Sunday, or Sunday/Monday. Looking at Australian/NZ clients, we see that there are many more weekday-only users if we allow Sunday/Monday UTC to be considered the weekend, and not many more weekday-only users if we allow Friday/Saturday UTC to be considered the weekend. For US clients, the effect is reversed and shrunken.

    The upshot is that we need to include Sunday/Monday UTC as “the weekend” so that we fairly represent AU/NZ weekday-only users. Including Friday/Saturday UTC as “the weekend” is less essential, but seems to add little noise to the segment so let’s just do it.

    Out of the last 27 days, we can count the number of weekend days that a client was active, using the SQL snippet:

    BIT_COUNT(
                cls.days_seen_bits & 0x0FFFFFFE & (
                    ((udf.bits28_from_string('0000011000001100000110000011') << 14)
                        + udf.bits28_from_string('00000110000011')
                    ) >> (8 - EXTRACT(DAYOFWEEK FROM cls.submission_date)))
            )
    

    where cls refers to the clients_last_seen table. (We can add OR statements with copies of this snippet that add/subtract 1 from the day of week, to handle the other tizemones)

    If we plot DAU for the weekday-only users (i.e. where the above snippet evaluates to 0), then we see a strongly-seasonal curve, albeit one that is nonzero on weekends because as usual we segment today’s data only on historical behaviour.

    If we relax our strictness and allow one historical day of weekend use in the past 27 days, then we capture a lot more weekday use and only a little more weekend use (see weekday_only_dau vs almost_weekday_only_dau on https://sql.telemetry.mozilla.org/queries/71353/source#179227). If we allow two historical days of weekend use in the past 27 days, then we capture more weekday use but start to include substantial weekend use. And the complement of this segment (all-week regular users) still has a lot of weekly seasonality - so I think two days is too much.

    Finally, we should decide whether to have segments both for “weekday regulars” and “all-week regulars”, or just define “weekday regulars” and let people compute “all-week regulars” as all clients who are Regular Users v3 but not a weekday regular. It’s probably going to be easier for people if we do both for them.

    opened by felixlawrence 5
  • Add BigQuery cookbook

    Add BigQuery cookbook

    This is my initial pass on BigQuery documentation. Please r? and provide any suggestions and changes.

    I think we need more examples but also believe those should probably live in the dataset documentation for the tables.

    opened by jasonthomas 5
  • Add draft of getting review doc

    Add draft of getting review doc

    This document isn't ready to be submitted yet, but I want to get some early feedback. This doc keeps getting pre-empted by competing work and I have a suspicion that this document may trigger some larger process discussions. I'd like to get those conversations started now.

    This needs a lot of organization, but most of the content is there. High level commentary welcome! Let's be sure to use reviewable since I'll likely revise this document a few times.

    You can preview the rendered documentation here


    This change is Reviewable

    opened by harterrt 5
  • Draft documentation for DS workflow for adding to clients_last_seen

    Draft documentation for DS workflow for adding to clients_last_seen

    I will be shopping around this workflow to some Data Scientists for review. It is intended to empower them to explore new feature usage definitions in a manner that has a clear path to being included in clients_last_seen if it proves useful.

    I'm coming to the conclusion that a general clients_all_time or feature_usage_all_time table is not feasible in the short term, but the technique is still useful for rapid prototyping of bit patterns that can be used to investigate new user segmentation, feature usage definitions, etc.

    Relevant to the Data Warehouse daily aggregations sub-project.

    opened by jklukas 4
  • Update for Fenix release

    Update for Fenix release

    This needs more work, probably, but this is the ~minimal edit to avoid pointing people at the Nightly table for now.

    aside: I love editing pipe tables by hand; it's great

    opened by tdsmith 4
  • Update the doc for Activity-Stream dataset

    Update the doc for Activity-Stream dataset

    @SuYoungHong r?

    Hey Su, I've updated this doc to reflect the current state of the Activity-Stream dataset.

    Change highlights:

    • Update all the document links to the firefox source as the Github repo has been archived
    • Update the database and table names as now we've switched to Bigquery from Redshift
    • Some caveats are now invalidated with the database migration
    • Update the sample queries

    Let me know what you think.

    Thanks!

    opened by ncloudioj 4
  • Add a page for standard metric definitions.

    Add a page for standard metric definitions.

    Create a place to build out definitions for standard metrics, as well as improving discoverability by bringing it to the top of the navigation and linking from the Getting Started page.

    The definitions are based directly on GUD.

    opened by mreid-moz 4
  • Add retention cookbook

    Add retention cookbook

    This is a cookbook that guides the reader through retention analysis. This is the most recent draft after an informal review by the majority of the Product Data Science Team.

    opened by benmiroglio 4
Owner
Mozilla
This technology could fall into the right hands.
Mozilla
Monorepo of our ahead-of-time implementation of Secure ECMAScript

Ahead-of-time Secure EcmaScript The monorepo contains a set of packages that helps adopt SES in a pre-compiled way. Security Assumptions This project

Dimension 13 Dec 21, 2022
The project brings the IC ecosystem to Unity, allowing Unity developers to call the functions of canisters on IC,

Agent of Internet Computer for Unity The Intro The project brings the IC ecosystem to Unity, allowing Unity developers to call the functions of canist

Shiku Labs 9 Nov 18, 2022
A holistic, minimal documentation portal for the Polkadot developers.

polkadot-sdk-docs A holistic, minimal documentation portal for the Polkadot developers. Master Tutorial The very, very rough plan that I have so far i

Parity Technologies 9 May 26, 2023
Kryptokrona SDK in Rust for building decentralized private communication and payment systems.

Kryptokrona Rust SDK Kryptokrona is a decentralized blockchain from the Nordic based on CryptoNote, which forms the basis for Monero, among others. Cr

null 5 May 25, 2023
A prototype implementation of the Host Identity Protocol v2 for bare-metal systems, written in pure-rust.

Host Identity Protocol for bare-metal systems, using Rust I've been evaluating TLS replacements in constrained environments for a while now. Embedded

null 31 Dec 12, 2022
OpenZKP - pure Rust implementations of Zero-Knowledge Proof systems.

OpenZKP OpenZKP - pure Rust implementations of Zero-Knowledge Proof systems. Overview Project current implements ?? the Stark protocol (see its readme

0x 529 Jan 5, 2023
The fly.io distributed systems challenges solved in Rust

The fly.io distributed systems challenges solved in Rust. Live-streamed in https://youtu.be/gboGyccRVXI License Licensed under either of Apache Licens

Jon Gjengset 162 Apr 19, 2023
interactive l-systems explorer

l-systems explorer An interactive L-systems explorer using nannou and egui. Inspired by lsys. how to run cargo run --release shortcuts Pan: click + d

Ivy Wong 3 Apr 6, 2024
Taking the best of Substrate Recipes and applying them to a new framework for structuring a collection of how-to guides.

Attention: This repository has been archived and is no longer being maintained. It has been replaced by the Substrate How-to Guides. Please use the Su

Substrate Developer Hub 35 Oct 17, 2022
A general solution for commonly used crypt in rust, collection of cryptography-related traits and algorithms.

Crypto-rs A general solution for commonly used crypt in rust, collection of cryptography-related traits and algorithms. This is a Rust implementation

houseme 4 Nov 28, 2022
CosmWasm-Examples is a collection of example contracts and applications built using the CosmWasm framework

CosmWasm-Examples is a collection of example contracts and applications built using the CosmWasm framework. CosmWasm is a secure and efficient smart contract platform designed specifically for the Cosmos ecosystem.

Vitalii Tsyhulov 20 Jun 9, 2023
A collection of lints to catch common mistakes and improve your Cairo code.

cairo-lint A collection of lints to catch common mistakes and improve your Cairo code. Usage cairo-lint can either be used as a library or as a standa

Keep Starknet Strange 19 Oct 18, 2024
Collection of cryptographic hash functions written in pure Rust

RustCrypto: hashes Collection of cryptographic hash functions written in pure Rust. All algorithms reside in the separate crates and implemented using

Rust Crypto 1.2k Jan 8, 2023
Dank - The Internet Computer Decentralized Bank - A collection of Open Internet Services - Including the Cycles Token (XTC)

Dank - The Internet Computer Decentralized Bank Dank is a collection of Open Internet Services for users and developers on the Internet Computer. In t

Psychedelic 56 Nov 12, 2022
The Solana Program Library (SPL) is a collection of on-chain programs targeting the Sealevel parallel runtime.

Solana Program Library The Solana Program Library (SPL) is a collection of on-chain programs targeting the Sealevel parallel runtime. These programs a

null 6 Jun 12, 2022
Collection of stream cipher algorithms

RustCrypto: stream ciphers Collection of stream cipher algorithms written in pure Rust. ⚠️ Security Warning: Hazmat! Crates in this repository do not

Rust Crypto 186 Dec 14, 2022
Collection of block cipher algorithms written in pure Rust

RustCrypto: block ciphers Collection of block ciphers and block modes written in pure Rust. Warnings Currently only the aes crate provides constant-ti

Rust Crypto 506 Jan 3, 2023
A collection of algorithms that can do join between two parties while preserving the privacy of keys on which the join happens

Private-ID Private-ID is a collection of algorithms to match records between two parties, while preserving the privacy of these records. We present tw

Meta Research 169 Dec 5, 2022
LibreAuth is a collection of tools for user authentication.

LibreAuth is a collection of tools for user authentication. Features Password / passphrase authentication no character-set limitation reason

Rodolphe Bréard 252 Dec 28, 2022