Skip to content

Commit

Permalink
Merge branch 'main' into fix/random_email_parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
wwoytenko committed Aug 17, 2024
2 parents 528b654 + b7f7cef commit 77777b5
Show file tree
Hide file tree
Showing 21 changed files with 1,143 additions and 951 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ backward-compatible with existing PostgreSQL utilities.
functions. This ensures that the same input data will always produce the same output data. Almost each transformer
supports either `random` or `hash` engine making it universal for any use case.
* **Dynamic parameters** — almost each transformer supports dynamic parameters, allowing to parametrize the
transformer dynamically from the table column value. This is helpful for resolving the functional dependencies
transformer dynamically from the table column value. This is helpful for resolving the functional dependencies
between columns and satisfying the constraints.
* **Cross-platform** - Can be easily built and executed on any platform, thanks to its Go-based architecture,
which eliminates platform dependencies.
Expand All @@ -38,6 +38,8 @@ backward-compatible with existing PostgreSQL utilities.
to deliver results.
* **Provide variety of storages** - Greenmask offers a variety of storage options for local and remote data storage,
including directories and S3-like storage solutions.
* **Pgzip support for faster compression** — by setting `--pgzip`, greenmask can speeds up the dump and restoration
processes through parallel compression.

## Use Cases

Expand Down
938 changes: 0 additions & 938 deletions docs/commands.md

This file was deleted.

11 changes: 11 additions & 0 deletions docs/commands/delete.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# delete command

Delete dump from the storage with a specific ID

```text
greenmask --config=config.yml delete dumpId
```

```shell title="example"
greenmask --config config.yml delete 1723643249862
```
72 changes: 72 additions & 0 deletions docs/commands/dump.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
## dump command

The `dump` command operates in the following way:

1. Dumps the data from the source database.
2. Validates the data for potential issues.
3. Applies the defined transformations.
4. Stores the transformed data in the specified storage location.

Note that the `dump` command shares the same parameters and environment variables as `pg_dump`,
allowing you to configure the restoration process as needed.

Mostly it supports the same flags as the `pg_dump` utility, with some extra flags for Greenmask-specific features.

```text title="Supported flags"
-b, --blobs include large objects in dump
-c, --clean clean (drop) database objects before recreating
-Z, --compress int compression level for compressed formats (default -1)
-C, --create include commands to create database in dump
-a, --data-only dump only the data, not the schema
-d, --dbname string database to dump (default "postgres")
--disable-dollar-quoting disable dollar quoting, use SQL standard quoting
--disable-triggers disable triggers during data-only restore
--enable-row-security enable row security (dump only content user has access to)
-E, --encoding string dump the data in encoding ENCODING
-N, --exclude-schema strings dump the specified schema(s) only
-T, --exclude-table strings do NOT dump the specified table(s)
--exclude-table-data strings do NOT dump data for the specified table(s)
-e, --extension strings dump the specified extension(s) only
--extra-float-digits string override default setting for extra_float_digits
-f, --file string output file or directory name
-h, --host string database server host or socket directory (default "/var/run/postgres")
--if-exists use IF EXISTS when dropping objects
--include-foreign-data strings use IF EXISTS when dropping objects
-j, --jobs int use this many parallel jobs to dump (default 1)
--load-via-partition-root load partitions via the root table
--lock-wait-timeout int fail after waiting TIMEOUT for a table lock (default -1)
-B, --no-blobs exclude large objects in dump
--no-comments do not dump comments
-O, --no-owner string skip restoration of object ownership in plain-text format
-X, --no-privileges do not dump privileges (grant/revoke)
--no-publications do not dump publications
--no-security-labels do not dump security label assignments
--no-subscriptions do not dump subscriptions
--no-sync do not wait for changes to be written safely to dis
--no-synchronized-snapshots do not use synchronized snapshots in parallel jobs
--no-tablespaces do not dump tablespace assignments
--no-toast-compression do not dump TOAST compression methods
--no-unlogged-table-data do not dump unlogged table data
--pgzip use pgzip compression instead of gzip
-p, --port int database server port number (default 5432)
--quote-all-identifiers quote all identifiers, even if not key words
-n, --schema strings dump the specified schema(s) only
-s, --schema-only string dump only the schema, no data
--section string dump named section (pre-data, data, or post-data)
--serializable-deferrable wait until the dump can run without anomalies
--snapshot string use given snapshot for the dump
--strict-names require table and/or schema include patterns to match at least one entity each
-S, --superuser string superuser user name to use in plain-text format
-t, --table strings dump the specified table(s) only
--test string connect as specified database user (default "postgres")
--use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership
-U, --username string connect as specified database user (default "postgres")
-v, --verbose string verbose mode
```

### Pgzip compression

By default, Greenmask uses gzip compression to restore data. In mist cases it is quite slow and does not utilize all
available resources and is a bootleneck for IO operations. To speed up the restoration process, you can use
the `--pgzip` flag to use pgzip compression instead of gzip. This method splits the data into blocks, which are
compressed in parallel, making it ideal for handling large volumes of data. The output remains a standard gzip file.
36 changes: 36 additions & 0 deletions docs/commands/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Commands

## Introduction

```shell title="Greenmask available commands"
greenmask \
--log-format=[json|text] \
--log-level=[debug|info|error] \
--config=config.yml \
[dump|list-dumps|delete|list-transformers|show-transformer|restore|show-dump]`
```

You can use the following commands within Greenmask:

* [list-transformers](list-transformers.md) — displays a list of available transformers along with their documentation
* [show-transformer](show-transformer.md) — displays information about the specified transformer
* [validate](validate.md) - performs a validation procedure by testing config, comparing transformed data, identifying
potential issues, and checking for schema changes.
* [dump](dump.md) — initiates the data dumping process
* [restore](list-dumps.md) — restores data to the target database either by specifying a `dumpId` or using the latest available dump
* [list-dumps](show-dump.md) — lists all available dumps stored in the system
* [show-dump](restore.md) — provides metadata information about a particular dump, offering insights into its structure and
attributes
* [delete](delete.md) — deletes a specific dump from the storage


For any of the commands mentioned above, you can include the following common flags:

* `--log-format` — specifies the desired format for log output, which can be either `json` or `text`. This parameter is
optional, with the default format set to `text`.
* `--log-level` — sets the desired level for log output, which can be one of `debug`, `info`, or `error`. This parameter
is optional, with the default log level being `info`.
* `--config` — requires the specification of a configuration file in YAML format. This configuration file is mandatory
for Greenmask to operate correctly.
* `--help` — displays comprehensive help information for Greenmask, providing guidance on its usage and available
commands.
20 changes: 20 additions & 0 deletions docs/commands/list-dumps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## list-dumps command

The `list-dumps` command provides a list of all dumps stored in the storage. The list includes the following attributes:

* `ID` — the unique identifier of the dump, used for operations like `restore`, `delete`, and `show-dump`
* `DATE` — the date when the snapshot was created
* `DATABASE` — the name of the database associated with the dump
* `SIZE` — the original size of the dump
* `COMPRESSED SIZE` — the size of the dump after compression
* `DURATION` — the duration of the dump procedure
* `TRANSFORMED` — indicates whether the dump has been transformed
* `STATUS` — the status of the dump, which can be one of the following:
* `done` — the dump was completed successfully
* `unknown` or `failed` — the dump might be in progress or failed. Failed dumps are not deleted automatically.

Example of `list-dumps` output:
![list_dumps_screen.png](../assets/list_dumps_screen.png)



58 changes: 58 additions & 0 deletions docs/commands/list-transformers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
## list-transformers command

The `list-transformers` command provides a list of all the allowed transformers, including both standard and advanced
transformers. This list can be helpful for searching for an appropriate transformer for your data transformation needs.

To show a list of available transformers, use the following command:

```shell
greenmask --config=config.yml list-transformers
```

Supported flags:

* `--format` — allows to select the output format. There are two options available: `text` or `json`. The
default setting is `text`.

Example of `list-transformers` output:

![list_transformers_screen.png](../assets/list_transformers_screen_2.png)

When using the `list-transformers` command, you receive a list of available transformers with essential information
about each of them. Below are the key parameters for each transformer:

* `NAME` — the name of the transformer
* `DESCRIPTION` — a brief description of what the transformer does
* `COLUMN PARAMETER NAME` — name of a column or columns affected by transformation
* `SUPPORTED TYPES` — list the supported value types

The JSON call `greenmask --config=config.yml list-transformers --format=json` has the same attributes:

```json title="JSON format output"
[
{
"name": "Cmd",
"description": "Transform data via external program using stdin and stdout interaction",
"parameters": [
{
"name": "columns",
"supported_types": [
"any"
]
}
]
},
{
"name": "Dict",
"description": "Replace values matched by dictionary keys",
"parameters": [
{
"name": "column",
"supported_types": [
"any"
]
}
]
}
]
```
116 changes: 116 additions & 0 deletions docs/commands/restore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
## restore command

To perform a dump restoration with the provided dump ID, use the following command:

```shell
greenmask --config=config.yml restore DUMP_ID
```

Alternatively, to restore the latest completed dump, use the following command:

```shell
greenmask --config=config.yml restore latest
```

Note that the `restore` command shares the same parameters and environment variables as `pg_restore`,
allowing you to configure the restoration process as needed.

Mostly it supports the same flags as the `pg_restore` utility, with some extra flags for Greenmask-specific features.

```text title="Supported flags"
-c, --clean clean (drop) database objects before recreating
-C, --create create the target database
-a, --data-only restore only the data, no schema
-d, --dbname string connect to database name (default "postgres")
--disable-triggers disable triggers during data-only restore
--enable-row-security enable row security
-N, --exclude-schema strings do not restore objects in this schema
-e, --exit-on-error exit on error, default is to continue
-f, --file string output file name (- for stdout)
-P, --function strings restore named function
-h, --host string database server host or socket directory (default "/var/run/postgres")
--if-exists use IF EXISTS when dropping objects
-i, --index strings restore named index
--inserts restore data as INSERT commands, rather than COPY
-j, --jobs int use this many parallel jobs to restore (default 1)
--list-format string use table of contents in format of text, json or yaml (default "text")
--no-comments do not restore comments
--no-data-for-failed-tables do not restore data of tables that could not be created
-O, --no-owner string skip restoration of object ownership
-X, --no-privileges skip restoration of access privileges (grant/revoke)
--no-publications do not restore publications
--no-security-labels do not restore security labels
--no-subscriptions ddo not restore subscriptions
--no-table-access-method do not restore table access methods
--no-tablespaces do not restore tablespace assignments
--on-conflict-do-nothing add ON CONFLICT DO NOTHING to INSERT commands
--pgzip use pgzip decompression instead of gzip
-p, --port int database server port number (default 5432)
--restore-in-order restore tables in topological order, ensuring that dependent tables are not restored until the tables they depend on have been restored
-n, --schema strings restore only objects in this schema
-s, --schema-only restore only the schema, no data
--section string restore named section (pre-data, data, or post-data)
-1, --single-transaction restore as a single transaction
--strict-names restore named section (pre-data, data, or post-data) match at least one entity each
-S, --superuser string superuser user name to use for disabling triggers
-t, --table strings restore named relation (table, view, etc.)
-T, --trigger strings restore named trigger
-L, --use-list string use table of contents from this file for selecting/ordering output
--use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership
-U, --username string connect as specified database user (default "postgres")
-v, --verbose string verbose mode
```

## Extra features

### Inserts and error handling

!!! warning

Insert commands are a lot slower than `COPY` commands. Use this feature only when necessary.

By default, Greenmask restores data using the `COPY` command. If you prefer to restore data using `INSERT` commands, you can
use the `--inserts` flag. This flag allows you to manage errors that occur during the execution of INSERT commands. By
configuring an error and constraint [exclusion list in the config](../configuration.md#restoration-error-exclusion),
you can skip certain errors and continue inserting subsequent rows from the dump.

This can be useful when adding new records to an existing dump, but you don't want the process to stop if some records
already exist in the database or violate certain constraints.

By adding the `--on-conflict-do-nothing` flag, it generates `INSERT` statements with the ON `CONFLICT DO NOTHING`
clause, similar to the original pg_dump option. However, this approach only works for unique or exclusion constraints.
If a foreign key is missing in the referenced table or any other constraint is violated, the insertion will still fail.
To handle these issues, you can define
an[exclusion list in the config](../configuration.md#restoration-error-exclusion).

```shell title="example with inserts and on conflict do nothing"
greenmask --config=config.yml restore DUMP_ID --inserts --on-conflict-do-nothing
```

### Restoration in topological order

By default, Greenmask restores tables in the order they are listed in the dump file. To restore tables in topological
order, use the `--restore-in-order` flag. This is particularly useful when your schema includes foreign key references and
you need to insert data in the correct order. Without this flag, you may encounter errors when inserting data into
tables with foreign key constraints.

!!! warning

Greenmask cannot guarantee restoration in topological order when the schema contains cycles. The only way to restore
tables with cyclic dependencies is to temporarily remove the foreign key constraint (to break the cycle), restore the
data, and then re-add the foreign key constraint once the data restoration is complete.


If your database has cyclic dependencies you will be notified about it but the restoration will continue.

```text
2024-08-16T21:39:50+03:00 WRN cycle between tables is detected: cannot guarantee the order of restoration within cycle cycle=["public.employees","public.departments","public.projects","public.employees"]
```

### Pgzip decompression

By default, Greenmask uses gzip decompression to restore data. In mist cases it is quite slow and does not utilize all
available resources and is a bootleneck for IO operations. To speed up the restoration process, you can use
the `--pgzip` flag to use pgzip decompression instead of gzip. This method splits the data into blocks, which are
decompressed in parallel, making it ideal for handling large volumes of data. The output remains a standard gzip file.

Loading

0 comments on commit 77777b5

Please sign in to comment.