Powerful database anonymizer with flexible rules. Written in Rust.

Overview

[Data]nymizer

datanymizer

Powerful database anonymizer with flexible rules. Written in Rust.

Datanymizer is created & supported by Evrone. What else we develop with Rust.

More information you can find in articles in English and Russian.

How it works

Database -> Dumper (+Faker) -> Dump.sql

You can import or process you dump with supported database without 3rd-party importers.

Datanymizer generates database-native dump.

Installation

There are several ways to install pg_datanymizer. Choose a more convenient option for you.

Pre-compiled binary

# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s

# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s

# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sh -s -- -b usr/local/bin v0.1.0

# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s

Homebrew / Linuxbrew

# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer

# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer

Docker

$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer

Getting started with CLI dumper

Inspect your database schema, choose fields with sensitive data and create config, based on it.

# config.yml
tables:
  - name: markets
    rules:
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 1
                max: 2
            - words:
                min: 1
                max: 2
  - name: franchisees
    rules:
      operator_mail:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
      operator_name:
        first_name: {}
      operator_phone:
        phone:
          format: +###########
      name_translations:
        template:
          format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
          rules:
            - words:
                min: 2
                max: 3
            - words:
                min: 2
                max: 3
  - name: users
    rules:
      first_name:
        first_name: {}
      last_name:
        last_name: {}
  - name: customers
    rules:
      email:
        template:
          format: user-{{_1}}-{{_2}}
          rules:
            - random_num: {}
            - email:
                kind: Safe
                uniq:  
                  required: true
                  try_count: 5
      phone:
        phone:
          format: +7##########
          uniq: true
      city:
        city: {}
      age:
        random_num:
          min: 10
          max: 99
      first_name:
        first_name: {}
      last_name:
        last_name: {}
      birth_date:
        datetime:
          from: 1990-01-01T00:00:00+00:00
          to: 2010-12-31T00:00:00+00:00

And then start to make dump from your database instance:

pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database

Uniqueness It creates new dump file /tmp/dump.sql with native SQL dump for Postgresql database. You can import fake data from this dump into new Postgresql database with command:

psql -Upostgres -d new_database < /tmp/dump.sql

Dumper can stream dump to STDOUT like pg_dump and you can use it in other pipelines:

pg_datanymizer -c ./config.yml postgres://postgres:postgres@localhost/test_database > /tmp/dump.sql

Additional options

Tables filter

You can specify which tables you choose or ignore for making dump.

For dumping only public.markets and public.users data.

# config.yml
#...
filter:
  only:
    - public.markets
    - public.users

For ignoring those tables and dump data from others.

# config.yml
#...
filter:
  except:
    - public.markets
    - public.users

You can also specify data and schema filters separately.

This is equivalent to the previous example.

# config.yml
#...
filter:
  data:
    except:
      - public.markets
      - public.users

For skipping schema and data from other tables.

# config.yml
#...
filter:
  schema:
    only:
      - public.markets
      - public.users

For skipping schema for markets table and dumping data only from users table.

# config.yml
#...
filter:
  data:
    only:
      - public.users
  schema:
    except:
      - public.markets

Global variables

You can specify global variables available from any template rule.

# config.yml
tables:
  users:
    bio:
      template:
        format: "User bio is {{var_a}}"
    age:
      template:
        format: {{_0 * global_multiplicator}}
#...
globals:
  var_a: Global variable 1
  global_multiplicator: 6

Available rules

Rule Description
email Emails with different options
ip IP addresses. Supports IPv4 and IPv6
words Lorem words with different length
first_name First name generator
last_name Last name generator
city City names generator
phone Generate random phone with different format
pipeline Use pipeline to generate more complicated values
capitalize Like filter, it capitalizes input value
template Template engine for generate random text with included rules
digit Random digit (in range 0..9)
random_number Random number with min and max options
password Password with different
length options (support max and min options)
datetime Make DateTime strings with options (from and to)
more than 70 rules in total...

Uniqueness

You can specify that result values must be unique (they are not unique by default). You can use short or full syntax.

Short:

uniq: true

Full:

uniq:
  required: true
  try_count: 5

Uniqueness is ensured by re-generating values when they are same. You can customize the number of attempts with try_count (this is an optional field, the default number of tries depends on the rule).

Currently, uniqueness is supported by: email, ip, phone, random_number.

Locales

You can specify the locale for individual rules:

first_name:
  locale: RU

The default locale is EN but you can specify a different default locale:

tables:
  # ........  
default:
  locale: RU

We also support ZH_TW (traditional chinese) and RU (translation in progress).

Supported databases

  • Postgresql
  • MySQL or MariaDB (TODO)

Sponsors

Sponsored by Evrone

License

MIT

Comments
  • Fetch tables metadata takes too long

    Fetch tables metadata takes too long

    Fetch tables metadata... takes more time than a full database dump.

    I have filters in place for only 4 tables I need out of 100+ but this still takes longer than I expect.

    opened by martingehrke 22
  • Error: no matching tables were found

    Error: no matching tables were found

    @evgeniy-r I am facing an issue while using the cli with my db. I am not sure that whether I am using the config correctly, can you help me rectify the issue.

    thank you in advance.

    Background

    I have a database with multiple schemas and I want to copy over the schema default$default and the tables under that schema. Of course, I also anonymize certain fields in the tables.

    config.yml

    tables:
      - name: default$default.Profile
        rules:
          firstName:
            first_name: {}
            ...
            ...
    filter:
      schema:
        only:
          - default$default
    

    Output

    $ ./bin/pg_datanymizer -f backup.sql -c anonymizer/config.yml --accept_invalid_certs $POSTGRES_CONNECTION_STRING
    Prepare data scheme...
    pg_dump error. Command:
    pg_dump --section pre-data -t default$default <POSTGRES_CONNECTION_STRING>
    Output:
    pg_dump: error: no matching tables were found
    
    opened by marmik18 18
  • Issue with reading from key value store

    Issue with reading from key value store

    Hi @evgeniy-r , I am facing some issues while using the key-value store. please correct me if I am using it wrong. User is connected to Profile with a foreign key in the User table. So I am using profile_id while storing the data to the store and then while retrieving I am just using the id from the Profile table.

    Config

    tables:
      - name: User
        rules:
          phone:
            template:
              format: "{{ _1 }}{{ store_write(key='user_phonenumbers.' ~ prev.profile_id, value=_1) }}"
              rules:
                - phone:
                    format: "+############"
                    uniq: true
      - name: Profile
        rules:
          phone:
            # reading phone numbers from `user_phonenumbers` stored in `User`
            template:
              format: "{{ store_read(key='user_phonenumbers.' ~ prev.id) }}"
    

    Output

    ...
    Prepare to dump table: default$default.Profile
    Error: Failed transform Failed to render 'TemplateTransformerTemplate'
    ERROR: Job failed: exit code 1
    FATAL: exit code 1         
    

    Just to confirm that whether the template is giving an error on reading, I tried removing the template under Profile and replaced it with phone to randomize it and it works that way. So I am quite sure that there is something wrong - either how I am implementing it or with the anonymizer.

    opened by marmik18 16
  • SSL support

    SSL support

    I am trying to use datanymizer in the environment that requires SSL connection (Heroku postgres). It fails with an error:

    $ docker run --rm -v /Users/f213/prj/education/dev-db:/app -w /app datanymizer/pg_datanymizer "postgres://user:pwd@<REDACTED>.eu-west-1.compute.amazonaws.com:5432/<REDACTED>?sslmode=require"
    Error: error performing TLS handshake: no TLS implementation configured
    
    Caused by:
        no TLS implementation configured
    

    Any chances this great project will support SSL?

    opened by f213 16
  • Postgres Dump missing Create statements

    Postgres Dump missing Create statements

    When I run pg_datanymizer against a GCP managed sql instance, I only get a data dump, it does not add any create statements for the relations.

    Running pg_dump directly against the same database with the same credentials gives a full dump as expected.

    The command I am running is pg_datanymizer db --config /app/config.yml --host cloud-sql-proxy --username <redacted> --password <redacted> --file /app/dump.sql --port 5432

    The config just has table rules, no filters or anything like that.

    Would appreciate any guidance on this.

    opened by inno-asiimwe 7
  • Print message if there's newer version of datanymizer

    Print message if there's newer version of datanymizer

    Hi ๐Ÿ‘‹

    What do you think about adding the update-informer crate to the project to print a message if there is a newer version of datanymizer?

    update-informer has support of GitHub releases.

    opened by mgrachev 6
  • Dump includes extra value per row for COPY on table I have configured

    Dump includes extra value per row for COPY on table I have configured

    It appears to impact fields that I have configuration for:

    tables:
      - name: user
        rules:
          name:
            person_name: {}
          birthdate:
            datetime:
              from: 1980-01-01T00:00:00+00:00
              to: 2021-07-24T00:00:00+00:00
    

    When loading the dump I see this error:

    COPY 0
    ERROR:  invalid input syntax for type timestamp with time zone: "Vanessa Wyman"
    CONTEXT:  COPY user, line 1, column birthdate: "Vanessa Wyman"
    
    COPY public.user("id", "firebase_uid", "created_at", "updated_at", "username", "bio", "verified", "flagged", "private_account", "avatar_image_id", "user_role_type", "avatar_foreground_color", "avatar_background_light_color", "avatar_background_dark_color", "name", "birthdate", "last_seen", "google_iap_uid") FROM STDIN;
    2afa72c2-f06a-4bcd-88e8-6bac7b32fa80	ANDROID- 17	2021-06-14 19:09:17.508424+00	2021-06-14 19:09:17.508424+00	somename	\N	f	f	f	\N	USER	\N	\N	\N	\N	Vanessa Wyman	2004-01-13T00:14:00+00:00	\N
    
    opened by derekr 6
  • Possible to exclude owner/privilege information

    Possible to exclude owner/privilege information

    Using pg_dump and pg_restore you can exclude owner/priv information using --no-owner and --no-privileges. Using psql < dump to do restores as documented, doesn't allow this type of thing. Is this possible without having to manually text process the dump file?

    opened by mcg 5
  • COPY handle no fields

    COPY handle no fields

    โœ“ Checklist:

    • [x] This PR has been added to CHANGELOG.md (at the top of the list);
    • [x] Tests for the changes have been added (for bug fixes / features);
    • [x] Docs have been added / updated (for bug fixes / features).

    See issue #146 for more information

    opened by mbeynon 4
  • In-place data modification

    In-place data modification

    Hi! Is it possible to make it work to change data in-place (without dump-restore cycle)? We already have automated backups-to-staging restoration cycle (postgres, basebackups) and would like to use this tool to make data masking in-place.

    opened by andrsp 4
  • Add template references to other transformed values of the same row

    Add template references to other transformed values of the same row

    Possible syntax (look at tr):

    tables:
      - name: users
        rules:
          login:
            username: {}
          preferences:
            template:
              format: '{"some_complex_json": {"field": "{{tr.login}}"}}'
    

    Tera allows also tr["login"].

    The name tr means "transformed row". We can use tr_row or transformed_row instead (it might be too long).

    We can also use just row, but what if in the future we decide to use the values โ€‹โ€‹of the original row too?

    opened by evgeniy-r 4
  • Error: No such file or directory (os error 2)

    Error: No such file or directory (os error 2)

    I have tried running the following command

    ./bin/pg_datanymizer -f backup.sql -c ./test.yml postgresql://postgres:[email protected]:5432/postgres
    

    where the test.yml is

    tables:
      - name: Profile
        rules:
          firstName:
            first_name: {}
          lastName:
            last_name: {}
    

    Also i have tried running this on the following OS / Environments:

    • macOS 12.4
    • Ubuntu 20.04
    • Node:16.15-alpine (docker engine, both on macOS 12.4 and Ubuntu 20.04)
    opened by marmik18 1
  • Problem with copying generated columns

    Problem with copying generated columns

    Anonymizer throws error when dealing with table with generated columns

    Error: db error: ERROR: column "tsv" is a generated column DETAIL: Generated columns cannot be used in COPY.

    pg_dump on its own works fine.

    postgresql and pg_dump version is 12.12 pg_datanymizer version is 0.6.0

    opened by ruslan-kurbanov-jr 1
  • Support pg_restore

    Support pg_restore

    Hi, thanks so much for your great tool it's very useful !

    Do you plan to implement "archive" mode for pg_restore ? the fact is, when the database is big, it could take a lot of time to be restored, and using the -j option of pg_restore and more coul be very benefit to gain performance.

    Thanks !

    opened by fryck 1
  • #181 add tera filter sha256_hash

    #181 add tera filter sha256_hash

    This PR wants to add a new tera filter function for repeated sha256 hashes with an optional salt.

    This intends not to be a solution for highly confidential secret, rather to provide a way of having a value in the output that can be verified when the salt is known.

    For highly confidential values a more flexible hash should be used.

    โœ“ Checklist:

    • [ ] This PR has been added to CHANGELOG.md (at the top of the list);
    • [X] Tests for the changes have been added (for bug fixes / features);
    • [ ] Docs have been added / updated (for bug fixes / features).
    opened by krysopath 4
  • "hash" transformer or "hash" tera filter

    Hello and thank you for this tool. It works pretty neat. <3

    My goal is to dump anonymized version of sensitive data in such a way that the value can be verified, given the original value. It is a common use case in our place to hmac(secret, value) or just sha(salt + value) to hide the field in a dump or log a pseudoanonymized value.

    Generating random values works in many places, but we need either a hash transformer or a tera filter capable for generating such hashes, that are 1 => 1 associations.

    I went ahead and implemented a sha256 transformer and then looked for tera filters to do the same for comparison. (But this is not working as simply as the transformer Trait)

    I would like to open a PR and get some review done. What do you think?

    opened by krysopath 5
Releases(v0.6.0)
Owner
[Data]nymizer
Powerful database anonymizer with flexible rules. Written in Rust.
[Data]nymizer
โš™๏ธ Workshop Publishing Utility for Garry's Mod, written in Rust & Svelte and powered by Tauri

โš™๏ธ gmpublisher Currently in Beta development. A powerful and feature-packed Workshop publisher for Garry's Mod is finally here! Click for downloads Ar

William 484 Jan 7, 2023
A neofetch alike program that shows hardware and distro information written in rust.

nyafetch A neofetch alike program that shows hardware and distro information written in rust. installing install $ make install # by default, install

null 16 Dec 15, 2022
Detects usage of unsafe Rust in a Rust crate and its dependencies.

cargo-geiger โ˜ข๏ธ A program that lists statistics related to the usage of unsafe Rust code in a Rust crate and all its dependencies. This cargo plugin w

Rust Secure Code Working Group 1.1k Dec 26, 2022
Rust Code Completion utility

Racer - code completion for Rust RACER = Rust Auto-Complete-er. A utility intended to provide Rust code completion for editors and IDEs. Maybe one day

null 3.4k Jan 4, 2023
Format Rust code

rustfmt Quick start On the Stable toolchain On the Nightly toolchain Installing from source Usage Running cargo fmt Running rustfmt directly Verifying

The Rust Programming Language 4.8k Jan 7, 2023
The Rust toolchain installer

rustup: the Rust toolchain installer Master CI Build Status Windows macOS Linux Etc rustup installs The Rust Programming Language from the official re

The Rust Programming Language 5.1k Jan 8, 2023
Repository for the Rust Language Server (aka RLS)

Rust Language Server (RLS) The RLS provides a server that runs in the background, providing IDEs, editors, and other tools with information about Rust

The Rust Programming Language 3.6k Jan 7, 2023
๐Ÿฆ€ The ultimate search extension for Rust

Rust Search Extension ็ฎ€ไฝ“ไธญๆ–‡ The ultimate search extension for Rust Search docs, crates, builtin attributes, official books, and error codes, etc in you

Huhu 962 Dec 30, 2022
a freeform Rust build system

tinyrick: a freeform Rust build system .---. ^ o{__ฯ‰__ o{ ^0^ -Let me out! ~~ ( // *|* \xx\) xx`|' = =

Andrew 48 Dec 16, 2022
The Curly programming language (now in Rust!)

Curly Curly is a functional programming language that focuses on iterators. Some of its main implementation features include sum types, iterators, lis

Curly Language 30 Jan 6, 2023
Some WIP payload in Rust running on M1.

m1saka Some WIP payload in Rust running on M1. Project informations The aim of this payload is to provide exploration capabilities while providing a s

Mary 10 Mar 7, 2021
A library to compile USDT probes into a Rust library

sonde sonde is a library to compile USDT probes into a Rust library, and to generate a friendly Rust idiomatic API around it. Userland Statically Defi

Wasmer 39 Oct 12, 2022
compile rust code into memes

cargo-memex Besides their size, rust binaries have a significant disadvantage: rust binaries are not memes yet. cargo-memex is a cargo subcommand that

Matthias Seitz 243 Dec 11, 2022
Automated license checking for rust. cargo lichking is a Cargo subcommand that checks licensing information for dependencies.

cargo-lichking Automated license checking for rust. cargo lichking is a Cargo subcommand that checks licensing information for dependencies. Liches ar

Nemo157 120 Dec 19, 2022
Create target folder as a RAMdisk for faster Rust compilation.

cargo-ramdisk This crate is only supported for unix like systems! cargo-ramdisk creates a ramdisk at the target folder of your project for ridiculousl

PauMAVA 20 Jan 8, 2023
cargo extension that can generate BitBake recipes utilizing the classes from meta-rust

cargo-bitbake cargo bitbake is a Cargo subcommand that generates a BitBake recipe that uses meta-rust to build a Cargo based project for Yocto Install

null 60 Oct 28, 2022
A small utility to compare Rust micro-benchmarks.

cargo benchcmp A small utility for comparing micro-benchmarks produced by cargo bench. The utility takes as input two sets of micro-benchmarks (one "o

Andrew Gallant 304 Dec 27, 2022
"goto" implementation for Rust

Goto/Label for Rust Tired of using newfangled control flow mechnisms like "loop," "while," and "for"? Well worry no more! Finally, "goto" and "label"

Dagan Martinez 92 Nov 9, 2022
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

null 5 Sep 6, 2023