From 1ed0708775d9d8d93443ce005057fc318d1114e0 Mon Sep 17 00:00:00 2001 From: Vadim Voitenko Date: Fri, 16 Aug 2024 15:36:25 +0300 Subject: [PATCH 1/4] doc: added pgzip support description --- README.md | 4 +++- docs/index.md | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 1664e6f3..10b83eb1 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ backward-compatible with existing PostgreSQL utilities. functions. This ensures that the same input data will always produce the same output data. Almost each transformer supports either `random` or `hash` engine making it universal for any use case. * **Dynamic parameters** — almost each transformer supports dynamic parameters, allowing to parametrize the - transformer dynamically from the table column value. This is helpful for resolving the functional dependencies + transformer dynamically from the table column value. This is helpful for resolving the functional dependencies between columns and satisfying the constraints. * **Cross-platform** - Can be easily built and executed on any platform, thanks to its Go-based architecture, which eliminates platform dependencies. @@ -38,6 +38,8 @@ backward-compatible with existing PostgreSQL utilities. to deliver results. * **Provide variety of storages** - Greenmask offers a variety of storage options for local and remote data storage, including directories and S3-like storage solutions. +* **Pgzip support for faster compression** — by setting `--pgzip`, greenmask can speeds up the dump and restoration +processes through parallel compression. ## Use Cases diff --git a/docs/index.md b/docs/index.md index db8012b4..cf88f057 100644 --- a/docs/index.md +++ b/docs/index.md @@ -44,6 +44,8 @@ obfuscation process remains fresh, predictable, and transparent. to deliver results. * **Provide variety of storages** — Greenmask offers a variety of storage options for local and remote data storage, including directories and S3-like storage solutions. +* **Pgzip support for faster compression** — by setting `--pgzip`, greenmask can speeds up the dump and restoration + processes through parallel compression. ## Use cases From 72c064b95f90cf8b9803242fd420cb630497a047 Mon Sep 17 00:00:00 2001 From: Vadim Voitenko Date: Fri, 16 Aug 2024 19:59:26 +0300 Subject: [PATCH 2/4] doc: added documentation for new features * added documentation for new features * restructured commands documentation format --- docs/commands.md | 4 +- docs/commands/delete.md | 11 + docs/commands/dump.md | 69 ++++++ docs/commands/index.md | 36 +++ docs/commands/list-dumps.md | 20 ++ docs/commands/list-transformers.md | 58 +++++ docs/commands/restore.md | 102 +++++++++ docs/commands/show-dump.md | 341 +++++++++++++++++++++++++++++ docs/commands/show-transformer.md | 99 +++++++++ docs/commands/validate.md | 259 ++++++++++++++++++++++ docs/configuration.md | 66 +++++- mkdocs.yml | 11 +- 12 files changed, 1073 insertions(+), 3 deletions(-) create mode 100644 docs/commands/delete.md create mode 100644 docs/commands/dump.md create mode 100644 docs/commands/index.md create mode 100644 docs/commands/list-dumps.md create mode 100644 docs/commands/list-transformers.md create mode 100644 docs/commands/restore.md create mode 100644 docs/commands/show-dump.md create mode 100644 docs/commands/show-transformer.md create mode 100644 docs/commands/validate.md diff --git a/docs/commands.md b/docs/commands.md index d5568232..a3d83b66 100644 --- a/docs/commands.md +++ b/docs/commands.md @@ -13,6 +13,8 @@ greenmask \ You can use the following commands within Greenmask: * `dump` — initiates the data dumping process +* `validate` - performs a validation procedure by testing config, comparing transformed data, identifying potential +issues, and checking for schema changes. * `list-dumps` — lists all available dumps stored in the system * `delete` — deletes a specific dump from the storage * `list-transformers` — displays a list of available transformers along with their documentation @@ -34,7 +36,7 @@ For any of the commands mentioned above, you can include the following common fl ## validate -The `validate` command allows you to perform a validation procedure and compare data transformations. +The `validate` command allows you to perform a validation procedure and compare transformed data. Below is a list of all supported flags for the `validate` command: diff --git a/docs/commands/delete.md b/docs/commands/delete.md new file mode 100644 index 00000000..47a39fe1 --- /dev/null +++ b/docs/commands/delete.md @@ -0,0 +1,11 @@ +# delete command + +Delete dump from the storage with a specific ID + +```text +greenmask --config=config.yml delete dumpId +``` + +```shell title="example" +greenmask --config config.yml delete 1723643249862 +``` diff --git a/docs/commands/dump.md b/docs/commands/dump.md new file mode 100644 index 00000000..95b8ffc1 --- /dev/null +++ b/docs/commands/dump.md @@ -0,0 +1,69 @@ +## dump command + +The `dump` command operates in the following way: + +1. Dumps the data from the source database. +2. Validates the data for potential issues. +3. Applies the defined transformations. +4. Stores the transformed data in the specified storage location. + +Note that the `dump` command shares the same parameters and environment variables as `pg_dump`, +allowing you to configure the restoration process as needed. + +Mostly it supports the same flags as the `pg_dump` utility, with some extra flags for Greenmask-specific features. + +```text title="Supported flags" +Usage: + greenmask dump [flags] + +Flags: + -b, --blobs include large objects in dump + -c, --clean clean (drop) database objects before recreating + -Z, --compress int compression level for compressed formats (default -1) + -C, --create include commands to create database in dump + -a, --data-only dump only the data, not the schema + -d, --dbname string database to dump (default "postgres") + --disable-dollar-quoting disable dollar quoting, use SQL standard quoting + --disable-triggers disable triggers during data-only restore + --enable-row-security enable row security (dump only content user has access to) + -E, --encoding string dump the data in encoding ENCODING + -N, --exclude-schema strings dump the specified schema(s) only + -T, --exclude-table strings do NOT dump the specified table(s) + --exclude-table-data strings do NOT dump data for the specified table(s) + -e, --extension strings dump the specified extension(s) only + --extra-float-digits string override default setting for extra_float_digits + -f, --file string output file or directory name + -h, --host string database server host or socket directory (default "/var/run/postgres") + --if-exists use IF EXISTS when dropping objects + --include-foreign-data strings use IF EXISTS when dropping objects + -j, --jobs int use this many parallel jobs to dump (default 1) + --load-via-partition-root load partitions via the root table + --lock-wait-timeout int fail after waiting TIMEOUT for a table lock (default -1) + -B, --no-blobs exclude large objects in dump + --no-comments do not dump comments + -O, --no-owner string skip restoration of object ownership in plain-text format + -X, --no-privileges do not dump privileges (grant/revoke) + --no-publications do not dump publications + --no-security-labels do not dump security label assignments + --no-subscriptions do not dump subscriptions + --no-sync do not wait for changes to be written safely to dis + --no-synchronized-snapshots do not use synchronized snapshots in parallel jobs + --no-tablespaces do not dump tablespace assignments + --no-toast-compression do not dump TOAST compression methods + --no-unlogged-table-data do not dump unlogged table data + --on-conflict-do-nothing add ON CONFLICT DO NOTHING to INSERT commands + -p, --port int database server port number (default 5432) + --quote-all-identifiers quote all identifiers, even if not key words + -n, --schema strings dump the specified schema(s) only + -s, --schema-only string dump only the schema, no data + --section string dump named section (pre-data, data, or post-data) + --serializable-deferrable wait until the dump can run without anomalies + --snapshot string use given snapshot for the dump + --strict-names require table and/or schema include patterns to match at least one entity each + -S, --superuser string superuser user name to use in plain-text format + -t, --table strings dump the specified table(s) only + --test string connect as specified database user (default "postgres") + --use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership + -U, --username string connect as specified database user (default "postgres") + -v, --verbose string verbose mode +``` \ No newline at end of file diff --git a/docs/commands/index.md b/docs/commands/index.md new file mode 100644 index 00000000..e50a1c1d --- /dev/null +++ b/docs/commands/index.md @@ -0,0 +1,36 @@ +# Commands + +## Introduction + +```shell title="Greenmask available commands" +greenmask \ +--log-format=[json|text] \ +--log-level=[debug|info|error] \ +--config=config.yml \ +[dump|list-dumps|delete|list-transformers|show-transformer|restore|show-dump]` +``` + +You can use the following commands within Greenmask: + +* [list-transformers](list-transformers.md) — displays a list of available transformers along with their documentation +* [show-transformer](show-transformer.md) — displays information about the specified transformer +* [validate](validate.md) - performs a validation procedure by testing config, comparing transformed data, identifying +potential issues, and checking for schema changes. +* [dump](dump.md) — initiates the data dumping process +* [restore](list-dumps.md) — restores data to the target database either by specifying a `dumpId` or using the latest available dump +* [list-dumps](show-dump.md) — lists all available dumps stored in the system +* [show-dump](restore.md) — provides metadata information about a particular dump, offering insights into its structure and + attributes +* [delete](delete.md) — deletes a specific dump from the storage + + +For any of the commands mentioned above, you can include the following common flags: + +* `--log-format` — specifies the desired format for log output, which can be either `json` or `text`. This parameter is +optional, with the default format set to `text`. +* `--log-level` — sets the desired level for log output, which can be one of `debug`, `info`, or `error`. This parameter +is optional, with the default log level being `info`. +* `--config` — requires the specification of a configuration file in YAML format. This configuration file is mandatory +for Greenmask to operate correctly. +* `--help` — displays comprehensive help information for Greenmask, providing guidance on its usage and available +commands. diff --git a/docs/commands/list-dumps.md b/docs/commands/list-dumps.md new file mode 100644 index 00000000..7bb2b147 --- /dev/null +++ b/docs/commands/list-dumps.md @@ -0,0 +1,20 @@ +## list-dumps command + +The `list-dumps` command provides a list of all dumps stored in the storage. The list includes the following attributes: + +* `ID` — the unique identifier of the dump, used for operations like `restore`, `delete`, and `show-dump` +* `DATE` — the date when the snapshot was created +* `DATABASE` — the name of the database associated with the dump +* `SIZE` — the original size of the dump +* `COMPRESSED SIZE` — the size of the dump after compression +* `DURATION` — the duration of the dump procedure +* `TRANSFORMED` — indicates whether the dump has been transformed +* `STATUS` — the status of the dump, which can be one of the following: + * `done` — the dump was completed successfully + * `unknown` or `failed` — the dump might be in progress or failed. Failed dumps are not deleted automatically. + +Example of `list-dumps` output: +![list_dumps_screen.png](../assets/list_dumps_screen.png) + + + diff --git a/docs/commands/list-transformers.md b/docs/commands/list-transformers.md new file mode 100644 index 00000000..8f7ed30d --- /dev/null +++ b/docs/commands/list-transformers.md @@ -0,0 +1,58 @@ +## list-transformers command + +The `list-transformers` command provides a list of all the allowed transformers, including both standard and advanced +transformers. This list can be helpful for searching for an appropriate transformer for your data transformation needs. + +To show a list of available transformers, use the following command: + +```shell +greenmask --config=config.yml list-transformers +``` + +Supported flags: + +* `--format` — allows to select the output format. There are two options available: `text` or `json`. The + default setting is `text`. + +Example of `list-transformers` output: + +![list_transformers_screen.png](../assets/list_transformers_screen_2.png) + +When using the `list-transformers` command, you receive a list of available transformers with essential information +about each of them. Below are the key parameters for each transformer: + +* `NAME` — the name of the transformer +* `DESCRIPTION` — a brief description of what the transformer does +* `COLUMN PARAMETER NAME` — name of a column or columns affected by transformation +* `SUPPORTED TYPES` — list the supported value types + +The JSON call `greenmask --config=config.yml list-transformers --format=json` has the same attributes: + +```json title="JSON format output" +[ + { + "name": "Cmd", + "description": "Transform data via external program using stdin and stdout interaction", + "parameters": [ + { + "name": "columns", + "supported_types": [ + "any" + ] + } + ] + }, + { + "name": "Dict", + "description": "Replace values matched by dictionary keys", + "parameters": [ + { + "name": "column", + "supported_types": [ + "any" + ] + } + ] + } +] +``` \ No newline at end of file diff --git a/docs/commands/restore.md b/docs/commands/restore.md new file mode 100644 index 00000000..2937c1ab --- /dev/null +++ b/docs/commands/restore.md @@ -0,0 +1,102 @@ +## restore command + +To perform a dump restoration with the provided dump ID, use the following command: + +```shell +greenmask --config=config.yml restore DUMP_ID +``` + +Alternatively, to restore the latest completed dump, use the following command: + +```shell +greenmask --config=config.yml restore latest +``` + +Note that the `restore` command shares the same parameters and environment variables as `pg_restore`, +allowing you to configure the restoration process as needed. + +Mostly it supports the same flags as the `pg_restore` utility, with some extra flags for Greenmask-specific features. + +```text title="Supported flags" +Flags: + -c, --clean clean (drop) database objects before recreating + -C, --create create the target database + -a, --data-only restore only the data, no schema + -d, --dbname string connect to database name (default "postgres") + --disable-triggers disable triggers during data-only restore + --enable-row-security enable row security + -N, --exclude-schema strings do not restore objects in this schema + -e, --exit-on-error exit on error, default is to continue + -f, --file string output file name (- for stdout) + -P, --function strings restore named function + -h, --host string database server host or socket directory (default "/var/run/postgres") + --if-exists use IF EXISTS when dropping objects + -i, --index strings restore named index + --inserts restore data as INSERT commands, rather than COPY + -j, --jobs int use this many parallel jobs to restore (default 1) + --list-format string use table of contents in format of text, json or yaml (default "text") + --no-comments do not restore comments + --no-data-for-failed-tables do not restore data of tables that could not be created + -O, --no-owner string skip restoration of object ownership + -X, --no-privileges skip restoration of access privileges (grant/revoke) + --no-publications do not restore publications + --no-security-labels do not restore security labels + --no-subscriptions ddo not restore subscriptions + --no-table-access-method do not restore table access methods + --no-tablespaces do not restore tablespace assignments + --on-conflict-do-nothing add ON CONFLICT DO NOTHING to INSERT commands + -p, --port int database server port number (default 5432) + --restore-in-order restore tables in topological order, ensuring that dependent tables are not restored until the tables they depend on have been restored + -n, --schema strings restore only objects in this schema + -s, --schema-only restore only the schema, no data + --section string restore named section (pre-data, data, or post-data) + -1, --single-transaction restore as a single transaction + --strict-names restore named section (pre-data, data, or post-data) match at least one entity each + -S, --superuser string superuser user name to use for disabling triggers + -t, --table strings restore named relation (table, view, etc.) + -T, --trigger strings restore named trigger + -L, --use-list string use table of contents from this file for selecting/ordering output + --use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership + -U, --username string connect as specified database user (default "postgres") + -v, --verbose string verbose mode +``` + +## Extra features + +### Inserts and error handling + +!!! warning + + Insert commands are a lot slower than `COPY` commands. Use this feature only when necessary. + +By default, Greenmask restores data using the `COPY` command. If you prefer to restore data using `INSERT` commands, you can +use the `--inserts` flag. This flag allows you to manage errors that occur during the execution of INSERT commands. By +configuring an error and constraint [exclusion list in the config](../configuration.md#restoration-error-exclusion), +you can skip certain errors and continue inserting subsequent rows from the dump. + +This can be useful when adding new records to an existing dump, but you don't want the process to stop if some records +already exist in the database or violate certain constraints. + +By adding the `--on-conflict-do-nothing` flag, it generates `INSERT` statements with the ON `CONFLICT DO NOTHING` +clause, similar to the original pg_dump option. However, this approach only works for unique or exclusion constraints. +If a foreign key is missing in the referenced table or any other constraint is violated, the insertion will still fail. +To handle these issues, you can define +an[exclusion list in the config](../configuration.md#restoration-error-exclusion). + +```shell title="example with inserts and on conflict do nothing" +greenmask --config=config.yml restore DUMP_ID --inserts --on-conflict-do-nothing +``` + +### Restoration in topological order + +By default, Greenmask restores tables in the order they are listed in the dump file. To restore tables in topological +order, use the `--restore-in-order` flag. This is particularly useful when your schema includes foreign key references and +you need to insert data in the correct order. Without this flag, you may encounter errors when inserting data into +tables with foreign key constraints. + +!!! warning + + Greenmask cannot guarantee restoration in topological order when the schema contains cycles. The only way to restore + tables with cyclic dependencies is to temporarily remove the foreign key constraint (to break the cycle), restore the + data, and then re-add the foreign key constraint once the data restoration is complete. + diff --git a/docs/commands/show-dump.md b/docs/commands/show-dump.md new file mode 100644 index 00000000..4e3d466b --- /dev/null +++ b/docs/commands/show-dump.md @@ -0,0 +1,341 @@ +## show-dump command + +This command provides details about all objects and data that can be restored, similar to the `pg_restore -l` command in +PostgreSQL. It helps you inspect the contents of the dump before performing the actual restoration. + +Parameters: + +* `--format` — format of printing. Can be `text` or `json`. + +To display metadata information about a dump, use the following command: + +```shell +greenmask --config=config.yml show-dump dumpID +``` + +=== "Text output example" +```text +; +; Archive created at 2023-10-30 12:52:38 UTC +; dbname: demo +; TOC Entries: 17 +; Compression: -1 +; Dump Version: 15.4 +; Format: DIRECTORY +; Integer: 4 bytes +; Offset: 8 bytes +; Dumped from database version: 15.4 +; Dumped by pg_dump version: 15.4 +; +; +; Selected TOC Entries: +; +3444; 0 0 ENCODING - ENCODING +3445; 0 0 STDSTRINGS - STDSTRINGS +3446; 0 0 SEARCHPATH - SEARCHPATH +3447; 1262 24970 DATABASE - demo postgres +3448; 0 0 DATABASE PROPERTIES - demo postgres +222; 1259 24999 TABLE bookings flights postgres +223; 1259 25005 SEQUENCE bookings flights_flight_id_seq postgres +3460; 0 0 SEQUENCE OWNED BY bookings flights_flight_id_seq postgres +3281; 2604 25030 DEFAULT bookings flights flight_id postgres +3462; 0 24999 TABLE DATA bookings flights postgres +3289; 2606 25044 CONSTRAINT bookings flights flights_flight_no_scheduled_departure_key postgres +3291; 2606 25046 CONSTRAINT bookings flights flights_pkey postgres +3287; 1259 42848 INDEX bookings flights_aircraft_code_status_idx postgres +3292; 1259 42847 INDEX bookings flights_status_aircraft_code_idx postgres +3293; 2606 25058 FK CONSTRAINT bookings flights flights_aircraft_code_fkey postgres +3294; 2606 25063 FK CONSTRAINT bookings flights flights_arrival_airport_fkey postgres +3295; 2606 25068 FK CONSTRAINT bookings flights flights_departure_airport_fkey postgres +``` +=== "JSON output example" + + ```json linenums="1" + { + "startedAt": "2023-10-29T20:50:19.948017+02:00", // (1) + "completedAt": "2023-10-29T20:50:22.19333+02:00", // (2) + "originalSize": 4053842, // (3) + "compressedSize": 686557, // (4) + "transformers": [ // (5) + { + "Schema": "bookings", // (6) + "Name": "flights", // (7) + "Query": "", // (8) + "Transformers": [ // (9) + { + "Name": "RandomDate", // (10) + "Params": { // (11) + "column": "c2NoZWR1bGVkX2RlcGFydHVyZQ==", + "max": "MjAyMy0wMS0wMiAwMDowMDowMC4wKzAz", + "min": "MjAyMy0wMS0wMSAwMDowMDowMC4wKzAz" + } + } + ], + "ColumnsTypeOverride": null // (12) + } + ], + "header": { // (13) + "creationDate": "2023-10-29T20:50:20+02:00", + "dbName": "demo", + "tocEntriesCount": 15, + "dumpVersion": "16.0 (Homebrew)", + "format": "TAR", + "integer": 4, + "offset": 8, + "dumpedFrom": "16.0 (Debian 16.0-1.pgdg120+1)", + "dumpedBy": "16.0 (Homebrew)", + "tocFileSize": 8090, + "compression": 0 + }, + "entries": [ // (14) + { + "dumpId": 3416, + "databaseOid": 0, + "objectOid": 0, + "objectType": "ENCODING", + "schema": "", + "name": "ENCODING", + "owner": "", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": null + }, + { + "dumpId": 3417, + "databaseOid": 0, + "objectOid": 0, + "objectType": "STDSTRINGS", + "schema": "", + "name": "STDSTRINGS", + "owner": "", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": null + }, + { + "dumpId": 3418, + "databaseOid": 0, + "objectOid": 0, + "objectType": "SEARCHPATH", + "schema": "", + "name": "SEARCHPATH", + "owner": "", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": null + }, + { + "dumpId": 3419, + "databaseOid": 16384, + "objectOid": 1262, + "objectType": "DATABASE", + "schema": "", + "name": "demo", + "owner": "postgres", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": null + }, + { + "dumpId": 3420, + "databaseOid": 0, + "objectOid": 0, + "objectType": "DATABASE PROPERTIES", + "schema": "", + "name": "demo", + "owner": "postgres", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": null + }, + { + "dumpId": 222, + "databaseOid": 16414, + "objectOid": 1259, + "objectType": "TABLE", + "schema": "bookings", + "name": "flights", + "owner": "postgres", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": null + }, + { + "dumpId": 223, + "databaseOid": 16420, + "objectOid": 1259, + "objectType": "SEQUENCE", + "schema": "bookings", + "name": "flights_flight_id_seq", + "owner": "postgres", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 222 + ] + }, + { + "dumpId": 3432, + "databaseOid": 0, + "objectOid": 0, + "objectType": "SEQUENCE OWNED BY", + "schema": "bookings", + "name": "flights_flight_id_seq", + "owner": "postgres", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 223 + ] + }, + { + "dumpId": 3254, + "databaseOid": 16445, + "objectOid": 2604, + "objectType": "DEFAULT", + "schema": "bookings", + "name": "flights flight_id", + "owner": "postgres", + "section": "PreData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 223, + 222 + ] + }, + { + "dumpId": 3434, + "databaseOid": 16414, + "objectOid": 0, + "objectType": "TABLE DATA", + "schema": "\"bookings\"", + "name": "\"flights\"", + "owner": "\"postgres\"", + "section": "Data", + "originalSize": 4045752, + "compressedSize": 678467, + "fileName": "3434.dat.gz", + "dependencies": [] + }, + { + "dumpId": 3261, + "databaseOid": 16461, + "objectOid": 2606, + "objectType": "CONSTRAINT", + "schema": "bookings", + "name": "flights flights_flight_no_scheduled_departure_key", + "owner": "postgres", + "section": "PostData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 222, + 222 + ] + }, + { + "dumpId": 3263, + "databaseOid": 16463, + "objectOid": 2606, + "objectType": "CONSTRAINT", + "schema": "bookings", + "name": "flights flights_pkey", + "owner": "postgres", + "section": "PostData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 222 + ] + }, + { + "dumpId": 3264, + "databaseOid": 16477, + "objectOid": 2606, + "objectType": "FK CONSTRAINT", + "schema": "bookings", + "name": "flights flights_aircraft_code_fkey", + "owner": "postgres", + "section": "PostData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 222 + ] + }, + { + "dumpId": 3265, + "databaseOid": 16482, + "objectOid": 2606, + "objectType": "FK CONSTRAINT", + "schema": "bookings", + "name": "flights flights_arrival_airport_fkey", + "owner": "postgres", + "section": "PostData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 222 + ] + }, + { + "dumpId": 3266, + "databaseOid": 16487, + "objectOid": 2606, + "objectType": "FK CONSTRAINT", + "schema": "bookings", + "name": "flights flights_departure_airport_fkey", + "owner": "postgres", + "section": "PostData", + "originalSize": 0, + "compressedSize": 0, + "fileName": "", + "dependencies": [ + 222 + ] + } + ] + } + ``` + { .annotate } + + 1. The date when the backup has been initiated, also indicating the snapshot date. + 2. The date when the backup process was successfully completed. + 3. The original size of the backup in bytes. + 4. The size of the backup after compression in bytes. + 5. A list of tables that underwent transformation during the backup. + 6. The schema name of the table. + 7. The name of the table. + 8. Custom query override, if applicable. + 9. A list of transformers that were applied during the backup. + 10. The name of the transformer. + 11. The parameters provided for the transformer. + 12. A mapping of overridden column types. + 13. The header information in the table of contents file. This provides the same details as the `--format=text` output in the previous snippet. + 14. The list of restoration entries. This offers the same information as the `--format=text` output in the previous snippet. + +!!! note + + The `json` format provides more detailed information compared to the `text` format. The `text` format is primarily used for backward compatibility and for generating a restoration list that can be used with `pg_restore -L listfile`. On the other hand, the `json` format provides comprehensive metadata about the dump, including information about the applied transformers and their parameters. The `json` format is especially useful for detailed dump introspection. diff --git a/docs/commands/show-transformer.md b/docs/commands/show-transformer.md new file mode 100644 index 00000000..1ea7bb53 --- /dev/null +++ b/docs/commands/show-transformer.md @@ -0,0 +1,99 @@ +## show-transformer command + +This command prints out detailed information about a transformer by a provided name, including specific attributes to +help you understand and configure the transformer effectively. + +To show detailed information about a transformer, use the following command: + +```shell +greenmask --config=config.yml show-transformer TRANSFORMER_NAME +``` + +Supported flags: + +* `--format` — allows to select the output format. There are two options available: `text` or `json`. The + default setting is `text`. + +Example of `show-transformer` output: + +![show_transformer.png](../assets/show_transformer.png) + +When using the `show-transformer` command, you receive detailed information about the transformer and its parameters and +their possible attributes. Below are the key parameters for each transformer: + +* `Name` — the name of the transformer +* `Description` — a brief description of what the transformer does +* `Parameters` — a list of transformer parameters, each with its own set of attributes. Possible attributes include: + + * `description` — a brief description of the parameter's purpose + * `required` — a flag indicating whether the parameter is required when configuring the transformer + * `link_parameter` — specifies whether the value of the parameter will be encoded using a specific parameter type + encoder. For example, if a parameter named `column` is linked to another parameter `start`, the `start` + parameter's value will be encoded according to the `column` type when the transformer is initialized. + * `cast_db_type` — indicates that the value should be encoded according to the database type. For example, when + dealing with the INTERVAL data type, you must provide the interval value in PostgreSQL format. + * `default_value` — the default value assigned to the parameter if it's not provided during configuration. + * `column_properties` — if a parameter represents the name of a column, it may contain additional properties, + including: + * `nullable` — indicates whether the transformer may produce NULL values, potentially violating the NOT NULL + constraint + * `unique` — specifies whether the transformer guarantees unique values for each call. If set to `true`, it + means that the transformer cannot produce duplicate values, ensuring compliance with the UNIQUE constraint. + * `affected` — indicates whether the column is affected during the transformation process. If not affected, the + column's value might still be required for transforming another column. + * `allowed_types` — a list of data types that are compatible with this parameter + * `skip_original_data` — specifies whether the original value of the column, before transformation, is relevant + for the transformation process + * `skip_on_null` — indicates whether the transformer should skip the transformation when the input column value + is NULL. If the column value is NULL, interaction with the transformer is unnecessary. + +!!! warning + + The default value in JSON format is base64 encoded. This might be changed in later version of Greenmask. + +```json title="JSON output example" +[ + { + "properties": { + "name": "NoiseFloat", + "description": "Make noise float for int", + "is_custom": false + }, + "parameters": [ + { + "name": "column", + "description": "column name", + "required": true, + "is_column": true, + "is_column_container": false, + "column_properties": { + "max_length": -1, + "affected": true, + "allowed_types": [ + "float4", + "float8", + "numeric" + ], + "skip_on_null": true + } + }, + { + "name": "ratio", + "description": "max random percentage for noise", + "required": false, + "is_column": false, + "is_column_container": false, + "default_value": "MC4x" + }, + { + "name": "decimal", + "description": "decimal of noised float value (number of digits after coma)", + "required": false, + "is_column": false, + "is_column_container": false, + "default_value": "NA==" + } + ] + } +] +``` diff --git a/docs/commands/validate.md b/docs/commands/validate.md new file mode 100644 index 00000000..26692286 --- /dev/null +++ b/docs/commands/validate.md @@ -0,0 +1,259 @@ +# validate command + +The `validate` command allows you to perform a validation procedure and compare transformed data. + +Below is a list of all supported flags for the `validate` command: + +```text title="Supported flags" +Usage: + greenmask validate [flags] + +Flags: + --data Perform test dump for --rows-limit rows and print it pretty + --diff Find difference between original and transformed data + --format string Format of output. possible values [text|json] (default "text") + --rows-limit uint Check tables dump only for specific tables (default 10) + --schema Make a schema diff between previous dump and the current state + --table strings Check tables dump only for specific tables + --table-format string Format of table output (only for --format=text). Possible values [vertical|horizontal] (default "vertical") + --transformed-only Print only transformed column and primary key + --warnings Print warnings +``` + +Validate command can exit with non-zero code when: + +* Any error occurred +* Validate was called with `--warnings` flag and there are warnings +* Validate was called with `--schema` flag and there are schema differences + +All of those cases may be used for CI/CD pipelines to stop the process when something went wrong. This is especially +useful when `--schema` flag is used - this allows to avoid data leakage when schema changed. + +You can use the `--table` flag multiple times to specify the tables you want to check. Tables can be written with +or without schema names (e. g., `public.table_name` or `table_name`). If you specify multiple tables from different +schemas, an error will be thrown. + +To start validation, use the following command: + +```shell +greenmask --config=config.yml validate \ + --warnings \ + --data \ + --diff \ + --schema \ + --format=text \ + --table-format=vertical \ + --transformed-only \ + --rows-limit=1 +``` + +```text title="Validation output example" +2024-03-15T19:46:12+02:00 WRN ValidationWarning={"hash":"aa808fb574a1359c6606e464833feceb","meta":{"ColumnName":"birthdate","ConstraintDef":"CHECK (birthdate \u003e= '1930-01-01'::date AND birthdate \u003c= (now() - '18 years'::interval))","ConstraintName":"humanresources","ConstraintSchema":"humanresources","ConstraintType":"Check","ParameterName":"column","SchemaName":"humanresources","TableName":"employee","TransformerName":"NoiseDate"},"msg":"possible constraint violation: column has Check constraint","severity":"warning"} +``` + +The validation output will provide detailed information about potential constraint violations and schema issues. Each +line contains nested JSON data under the `ValidationWarning` key, offering insights into the affected part of the +configuration and potential constraint violations. + +```json title="Pretty formatted validation warning" +{ + "hash": "aa808fb574a1359c6606e464833feceb", // (13) + "meta": { // (1) + "ColumnName": "birthdate", // (2) + "ConstraintDef": "CHECK (birthdate >= '1930-01-01'::date AND birthdate <= (now() - '18 years'::interval))", // (3) + "ConstraintName": "humanresources", // (4) + "ConstraintSchema": "humanresources", // (5) + "ConstraintType": "Check", // (6) + "ParameterName": "column", // (7) + "SchemaName": "humanresources", // (8) + "TableName": "employee", // (9) + "TransformerName": "NoiseDate" // (10) + }, + "msg": "possible constraint violation: column has Check constraint", // (11) + "severity": "warning" // (12) +} +``` +{ .annotate } + +1. **Detailed metadata**. The validation output provides comprehensive metadata to pinpoint the source of problems. +2. **Column name** indicates the name of the affected column. +3. **Constraint definition** specifies the definition of the constraint that may be violated. +4. **Constraint name** identifies the name of the constraint that is potentially violated. +5. **Constraint schema name** indicates the schema in which the constraint is defined. +6. **Type of constraint** represents the type of constraint and can be one of the following: + ``` + * ForeignKey + * Check + * NotNull + * PrimaryKey + * PrimaryKeyReferences + * Unique + * Length + * Exclusion + * TriggerConstraint + ``` +7. **Table schema name** specifies the schema name of the affected table. +8. **Table name** identifies the name of the table where the problem occurs. +9. **Transformer name** indicates the name of the transformer responsible for the transformation. +10. **Name of affected parameter** typically, this is the name of the column parameter that is relevant to the + validation warning. +11. **Validation warning description** provides a detailed description of the validation warning and the reason behind + it. +12. **Severity of validation warning** indicates the severity level of the validation warning and can be one of the + following: + ``` + * error + * warning + * info + * debug + ``` +13. **Hash** is a unique identifier of the validation warning. It is used to resolve the warning in the config file + +!!! note + + A validation warning with a severity level of `"error"` is considered critical and must be addressed before the dump operation can proceed. Failure to resolve such warnings will prevent the dump operation from being executed. + +```text title="Schema diff changed output example" +2024-03-15T19:46:12+02:00 WRN Database schema has been changed Hint="Check schema changes before making new dump" PreviousDumpId=1710520855501 +2024-03-15T19:46:12+02:00 WRN Column renamed Event=ColumnRenamed Signature={"CurrentColumnName":"id1","PreviousColumnName":"id","TableName":"test","TableSchema":"public"} +2024-03-15T19:46:12+02:00 WRN Column type changed Event=ColumnTypeChanged Signature={"ColumnName":"id","CurrentColumnType":"bigint","CurrentColumnTypeOid":"20","PreviousColumnType":"integer","PreviousColumnTypeOid":"23","TableName":"test","TableSchema":"public"} +2024-03-15T19:46:12+02:00 WRN Column created Event=ColumnCreated Signature={"ColumnName":"name","ColumnType":"text","TableName":"test","TableSchema":"public"} +2024-03-15T19:46:12+02:00 WRN Table created Event=TableCreated Signature={"SchemaName":"public","TableName":"test1","TableOid":"20563"} +``` + +Example of validation diff: + +![img.png](../assets/validate_horizontal_diff.png) + +The validation diff is presented in a neatly formatted table. In this table: + +* Columns that are affected by the transformation are highlighted with a red background. +* The pre-transformation values are displayed in green. +* The post-transformation values are shown in red. +* The result in `--format=text` can be displayed in either horizontal (`--table-format=horizontal`) or + vertical (`--table-format=vertical`) format, making it easy to visualize and understand the + differences between the original and transformed data. + +The whole validate command may be run in json format including logging making easy to parse the structure. + +```shell +greenmask --config=config.yml validate \ + --warnings \ + --data \ + --diff \ + --schema \ + --format=json \ + --table-format=vertical \ + --transformed-only \ + --rows-limit=1 \ + --log-format=json +``` + +The json object result + +=== "The validation warning" + + ```json + { + "level": "warn", + "ValidationWarning": { + "msg": "possible constraint violation: column has Check constraint", + "severity": "warning", + "meta": { + "ColumnName": "birthdate", + "ConstraintDef": "CHECK (birthdate >= '1930-01-01'::date AND birthdate <= (now() - '18 years'::interval))", + "ConstraintName": "humanresources", + "ConstraintSchema": "humanresources", + "ConstraintType": "Check", + "ParameterName": "column", + "SchemaName": "humanresources", + "TableName": "employee", + "TransformerName": "NoiseDate" + }, + "hash": "aa808fb574a1359c6606e464833feceb" + }, + "time": "2024-03-15T20:01:51+02:00" + } + ``` + +=== "Schema diff events" + + ```json + { + "level": "warn", + "PreviousDumpId": "1710520855501", + "Diff": [ + { + "event": "ColumnRenamed", + "signature": { + "CurrentColumnName": "id1", + "PreviousColumnName": "id", + "TableName": "test", + "TableSchema": "public" + } + }, + { + "event": "ColumnTypeChanged", + "signature": { + "ColumnName": "id", + "CurrentColumnType": "bigint", + "CurrentColumnTypeOid": "20", + "PreviousColumnType": "integer", + "PreviousColumnTypeOid": "23", + "TableName": "test", + "TableSchema": "public" + } + }, + { + "event": "ColumnCreated", + "signature": { + "ColumnName": "name", + "ColumnType": "text", + "TableName": "test", + "TableSchema": "public" + } + }, + { + "event": "TableCreated", + "signature": { + "SchemaName": "public", + "TableName": "test1", + "TableOid": "20563" + } + } + ], + "Hint": "Check schema changes before making new dump", + "time": "2024-03-15T20:01:51+02:00", + "message": "Database schema has been changed" + } + ``` + +=== "Transformation diff line" + + ```json + { + "schema": "humanresources", + "name": "employee", + "primary_key_columns": [ + "businessentityid" + ], + "with_diff": true, + "transformed_only": true, + "records": [ + { + "birthdate": { + "original": "1969-01-29", + "transformed": "1964-10-20", + "equal": false, + "implicit": true + }, + "businessentityid": { + "original": "1", + "transformed": "1", + "equal": true, + "implicit": true + } + } + ] + } + ``` diff --git a/docs/configuration.md b/docs/configuration.md index 9c4cfe9b..4c7d14a1 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -248,6 +248,8 @@ In the `restore` section of the configuration, you can specify parameters for th * `query` — an SQL query string to be executed * `query_file` — the path to an SQL query file to be executed * `command` — a command with parameters to be executed. It is provided as a list, where the first item is the command name. +* `insert_error_exclusions` — a list of error codes that should be ignored during the restoration process. This is +useful when you want to skip specific errors that are not critical for the restoration process. As mentioned in [the architecture](architecture.md/#backing-up), a backup contains three sections: pre-data, data, and post-data. The custom script execution allows you to customize and control the restoration process by executing scripts or commands at specific stages. The available restoration stages and their corresponding execution conditions are as follows: @@ -260,7 +262,7 @@ Each stage can have a `"when"` condition with one of the following possible valu * `before` — execute the script or SQL command before the mentioned restoration stage * `after` — execute the script or SQL command after the mentioned restoration stage -Below you can one of the possible versions for the `scripts` part of the `restore` section: +Below you can find one of the possible versions for the `scripts` part of the `restore` section: ``` yaml title="scripts definition example" scripts: @@ -302,6 +304,68 @@ scripts: 3. **List of post-data stage scripts**. This section contains scripts that are executed before or after the restoration of the post-data section. The scripts include SQL queries and query files. 4. **Command in the first argument and the parameters in the rest of the list**. When specifying a command to be executed in the scripts section, you provide the command name as the first item in a list, followed by any parameters or arguments for that command. The command and its parameters are provided as a list within the script configuration. +### restoration error exclusion + +You can configure which errors to ignore during the restoration process by setting the insert_error_exclusions +parameter. This parameter can be applied globally or per table. If both global and table-specific settings are defined, +the table-specific settings will take precedence. Below is an example of how to configure the insert_error_exclusions +parameter. You can specify constraint names from your database schema or the error codes returned by PostgreSQL. +[codes in the PostgreSQL documentation](https://www.postgresql.org/docs/current/errcodes-appendix.html). + +```yaml title="parameter defintion" +insert_error_exclusions: + + global: + error_codes: ["23505"] # (1) + constraints: ["PK_ProductReview_ProductReviewID"] # (2) + tables: # (3) + - schema: "production" + name: "productreview" + constraints: ["PK_ProductReview_ProductReviewID"] + error_codes: ["23505"] + +``` + +1. List of strings that contains postgresql error codes +2. List of strings that contains constraint names (globally) +3. List of tables with their schema, name, constraints, and error codes + + +Here is an example configuration for the `restore` section: + +```yaml +restore: + scripts: + pre-data: # (1) + - name: "pre-data before script [1] with query" + when: "before" + query: "create table script_test(stage text)" + + insert_error_exclusions: + tables: + - schema: "production" + name: "productreview" + constraints: + - "PK_ProductReview_ProductReviewID" + error_codes: + - "23505" + global: + error_codes: + - "23505" + + pg_restore_options: + jobs: 10 + exit-on-error: false + dbname: "postgresql://postgres:example@localhost:54316/transformed" + table: + - "productreview" + pgzip: true + inserts: true + on-conflict-do-nothing: true + restore-in-order: true + +``` + ## Environment variable configuration It's also possible to configure Greenmask through environment variables. diff --git a/mkdocs.yml b/mkdocs.yml index 49b2847f..152ec551 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -50,7 +50,16 @@ nav: - Configuration: configuration.md - Architecture: architecture.md - Playground: playground.md - - Commands: commands.md + - Commands: + - commands/index.md + - list-transformers: commands/list-transformers.md + - show-transformer: commands/show-transformer.md + - validate: commands/validate.md + - dump: commands/dump.md + - list-dumps: commands/list-dumps.md + - show-dump: commands/show-dump.md + - restore: commands/restore.md + - delete: commands/delete.md - Database subset: database_subset.md - Transformers: - built_in_transformers/index.md From 6ef8a7f736abdbab7032a988f869173f6f2eb82c Mon Sep 17 00:00:00 2001 From: Vadim Voitenko Date: Fri, 16 Aug 2024 21:42:58 +0300 Subject: [PATCH 3/4] feat: added cycles verbosity for --restore-in-order fixed the case when condensed graph is not built when subset is not provided --- docs/commands/restore.md | 6 ++++ internal/db/postgres/cmd/dump.go | 3 +- internal/db/postgres/cmd/restore.go | 13 +++++++++ internal/db/postgres/storage/metadata_json.go | 3 ++ internal/db/postgres/subset/graph.go | 29 +++++++++++++++++-- internal/db/postgres/subset/set_queries.go | 1 - 6 files changed, 51 insertions(+), 4 deletions(-) diff --git a/docs/commands/restore.md b/docs/commands/restore.md index 2937c1ab..081094b0 100644 --- a/docs/commands/restore.md +++ b/docs/commands/restore.md @@ -100,3 +100,9 @@ tables with foreign key constraints. tables with cyclic dependencies is to temporarily remove the foreign key constraint (to break the cycle), restore the data, and then re-add the foreign key constraint once the data restoration is complete. + +If your database has cyclic dependencies you will be notified about it but the restoration will continue. + +```text +2024-08-16T21:39:50+03:00 WRN cycle between tables is detected: cannot guarantee the order of restoration within cycle cycle=["public.employees","public.departments","public.projects","public.employees"] +``` diff --git a/internal/db/postgres/cmd/dump.go b/internal/db/postgres/cmd/dump.go index a2765f99..e88ff12b 100644 --- a/internal/db/postgres/cmd/dump.go +++ b/internal/db/postgres/cmd/dump.go @@ -406,9 +406,10 @@ func (d *Dump) mergeAndWriteToc(ctx context.Context) error { } func (d *Dump) writeMetaData(ctx context.Context, startedAt, completedAt time.Time) error { + cycles := d.context.Graph.GetCycledTables() metadata, err := storageDto.NewMetadata( d.resultToc, d.tocFileSize, startedAt, completedAt, d.config.Dump.Transformation, d.dumpedObjectSizes, - d.context.DatabaseSchema, d.dumpDependenciesGraph, d.sortedTablesDumpIds, + d.context.DatabaseSchema, d.dumpDependenciesGraph, d.sortedTablesDumpIds, cycles, ) if err != nil { return fmt.Errorf("unable build metadata: %w", err) diff --git a/internal/db/postgres/cmd/restore.go b/internal/db/postgres/cmd/restore.go index a6ec034a..3315789b 100644 --- a/internal/db/postgres/cmd/restore.go +++ b/internal/db/postgres/cmd/restore.go @@ -537,12 +537,25 @@ func (r *Restore) prune() { } } +func (r *Restore) logWarningsIfHasCycles() { + if len(r.metadata.Cycles) == 0 { + return + } + for _, cycle := range r.metadata.Cycles { + log.Warn(). + Strs("cycle", cycle). + Msg("cycle between tables is detected: cannot guarantee the order of restoration within cycle") + } +} + func (r *Restore) sortTocEntriesInTopoOrder() []*toc.Entry { res := make([]*toc.Entry, 0, len(r.tocObj.Entries)) preDataEnd := 0 postDataStart := 0 + r.logWarningsIfHasCycles() + // Find predata last index and postdata first index for idx, item := range r.tocObj.Entries { if item.Section == toc.SectionPreData { diff --git a/internal/db/postgres/storage/metadata_json.go b/internal/db/postgres/storage/metadata_json.go index d0a51b4a..15fc49bc 100644 --- a/internal/db/postgres/storage/metadata_json.go +++ b/internal/db/postgres/storage/metadata_json.go @@ -71,6 +71,7 @@ type Metadata struct { Entries []*Entry `yaml:"entries" json:"entries"` DependenciesGraph map[int32][]int32 `yaml:"dependencies_graph" json:"dependencies_graph"` DumpIdsOrder []int32 `yaml:"dump_ids_order" json:"dump_ids_order"` + Cycles [][]string `yaml:"cycles" json:"cycles"` } func NewMetadata( @@ -78,6 +79,7 @@ func NewMetadata( completedAt time.Time, transformers []*domains.Table, stats map[int32]ObjectSizeStat, databaseSchema []*toolkit.Table, dependenciesGraph map[int32][]int32, dumpIdsOrder []int32, + cycles [][]string, ) (*Metadata, error) { var format string @@ -168,6 +170,7 @@ func NewMetadata( DatabaseSchema: databaseSchema, DependenciesGraph: dependenciesGraph, DumpIdsOrder: dumpIdsOrder, + Cycles: cycles, Header: Header{ CreationDate: tocObj.Header.CrtmDateTime.Time(), DbName: *tocObj.Header.ArchDbName, diff --git a/internal/db/postgres/subset/graph.go b/internal/db/postgres/subset/graph.go index 1b41f281..8b9faecf 100644 --- a/internal/db/postgres/subset/graph.go +++ b/internal/db/postgres/subset/graph.go @@ -108,7 +108,7 @@ func NewGraph(ctx context.Context, tx pgx.Tx, tables []*entries.Table) (*Graph, edgeIdSequence++ } } - return &Graph{ + g := &Graph{ tables: tables, graph: graph, paths: make(map[int]*Path), @@ -116,7 +116,32 @@ func NewGraph(ctx context.Context, tx pgx.Tx, tables []*entries.Table) (*Graph, visited: make([]int, len(tables)), order: make([]int, 0), reversedGraph: reversedGraph, - }, nil + } + g.buildCondensedGraph() + return g, nil +} + +func (g *Graph) GetCycles() [][]*Edge { + var cycles [][]*Edge + for _, c := range g.scc { + if c.hasCycle() { + cycles = append(cycles, c.cycles...) + } + } + return cycles +} + +func (g *Graph) GetCycledTables() (res [][]string) { + cycles := g.GetCycles() + for _, c := range cycles { + var tables []string + for _, e := range c { + tables = append(tables, fmt.Sprintf(`%s.%s`, e.from.table.Schema, e.from.table.Name)) + } + tables = append(tables, fmt.Sprintf(`%s.%s`, c[len(c)-1].to.table.Schema, c[len(c)-1].to.table.Name)) + res = append(res, tables) + } + return res } // findSubsetVertexes - finds the subset vertexes in the graph diff --git a/internal/db/postgres/subset/set_queries.go b/internal/db/postgres/subset/set_queries.go index 75f969da..9e97f163 100644 --- a/internal/db/postgres/subset/set_queries.go +++ b/internal/db/postgres/subset/set_queries.go @@ -1,7 +1,6 @@ package subset func SetSubsetQueries(graph *Graph) error { - graph.buildCondensedGraph() graph.findSubsetVertexes() for _, p := range graph.paths { if isPathForScc(p, graph) { From d7e670588b354eb9dc0080df1e25188928721c5a Mon Sep 17 00:00:00 2001 From: Vadim Voitenko Date: Sat, 17 Aug 2024 10:38:18 +0300 Subject: [PATCH 4/4] doc: added info about de/compression in restore and dump fixed links --- docs/commands.md | 940 -------------------------- docs/commands/dump.md | 15 +- docs/commands/restore.md | 10 +- docs/configuration.md | 8 +- docs/playground.md | 2 +- docs/release_notes/greenmask_0_1_5.md | 2 +- 6 files changed, 24 insertions(+), 953 deletions(-) delete mode 100644 docs/commands.md diff --git a/docs/commands.md b/docs/commands.md deleted file mode 100644 index 8ffd58ac..00000000 --- a/docs/commands.md +++ /dev/null @@ -1,940 +0,0 @@ -# Commands - -## Introduction - -```shell title="Greenmask available commands" -greenmask \ - --log-format=[json|text] \ - --log-level=[debug|info|error] \ - --config=config.yml \ - [dump|list-dumps|delete|list-transformers|show-transformer|restore|show-dump]` -``` - -You can use the following commands within Greenmask: - -* `dump` — initiates the data dumping process -* `validate` - performs a validation procedure by testing config, comparing transformed data, identifying potential -issues, and checking for schema changes. -* `list-dumps` — lists all available dumps stored in the system -* `delete` — deletes a specific dump from the storage -* `list-transformers` — displays a list of available transformers along with their documentation -* `show-transformer` — displays information about the specified transformer -* `restore` — restores data to the target database either by specifying a `dumpId` or using the latest available dump -* `show-dump` — provides metadata information about a particular dump, offering insights into its structure and - attributes - -For any of the commands mentioned above, you can include the following common flags: - -* `--log-format` — specifies the desired format for log output, which can be either `json` or `text`. This parameter is - optional, with the default format set to `text`. -* `--log-level` — sets the desired level for log output, which can be one of `debug`, `info`, or `error`. This parameter - is optional, with the default log level being `info`. -* `--config` — requires the specification of a configuration file in YAML format. This configuration file is mandatory - for Greenmask to operate correctly. -* `--help` — displays comprehensive help information for Greenmask, providing guidance on its usage and available - commands. - -## validate - -The `validate` command allows you to perform a validation procedure and compare transformed data. - -Below is a list of all supported flags for the `validate` command: - -```text title="Supported flags" -Usage: - greenmask validate [flags] - -Flags: - --data Perform test dump for --rows-limit rows and print it pretty - --diff Find difference between original and transformed data - --format string Format of output. possible values [text|json] (default "text") - --rows-limit uint Check tables dump only for specific tables (default 10) - --schema Make a schema diff between previous dump and the current state - --table strings Check tables dump only for specific tables - --table-format string Format of table output (only for --format=text). Possible values [vertical|horizontal] (default "vertical") - --transformed-only Print only transformed column and primary key - --warnings Print warnings -``` - -Validate command can exit with non-zero code when: - -* Any error occurred -* Validate was called with `--warnings` flag and there are warnings -* Validate was called with `--schema` flag and there are schema differences - -All of those cases may be used for CI/CD pipelines to stop the process when something went wrong. This is especially -useful when `--schema` flag is used - this allows to avoid data leakage when schema changed. - -You can use the `--table` flag multiple times to specify the tables you want to check. Tables can be written with -or without schema names (e. g., `public.table_name` or `table_name`). If you specify multiple tables from different -schemas, an error will be thrown. - -To start validation, use the following command: - -```shell -greenmask --config=config.yml validate \ - --warnings \ - --data \ - --diff \ - --schema \ - --format=text \ - --table-format=vertical \ - --transformed-only \ - --rows-limit=1 -``` - -```text title="Validation output example" -2024-03-15T19:46:12+02:00 WRN ValidationWarning={"hash":"aa808fb574a1359c6606e464833feceb","meta":{"ColumnName":"birthdate","ConstraintDef":"CHECK (birthdate \u003e= '1930-01-01'::date AND birthdate \u003c= (now() - '18 years'::interval))","ConstraintName":"humanresources","ConstraintSchema":"humanresources","ConstraintType":"Check","ParameterName":"column","SchemaName":"humanresources","TableName":"employee","TransformerName":"NoiseDate"},"msg":"possible constraint violation: column has Check constraint","severity":"warning"} -``` - -The validation output will provide detailed information about potential constraint violations and schema issues. Each -line contains nested JSON data under the `ValidationWarning` key, offering insights into the affected part of the -configuration and potential constraint violations. - -```json title="Pretty formatted validation warning" -{ - "hash": "aa808fb574a1359c6606e464833feceb", // (13) - "meta": { // (1) - "ColumnName": "birthdate", // (2) - "ConstraintDef": "CHECK (birthdate >= '1930-01-01'::date AND birthdate <= (now() - '18 years'::interval))", // (3) - "ConstraintName": "humanresources", // (4) - "ConstraintSchema": "humanresources", // (5) - "ConstraintType": "Check", // (6) - "ParameterName": "column", // (7) - "SchemaName": "humanresources", // (8) - "TableName": "employee", // (9) - "TransformerName": "NoiseDate" // (10) - }, - "msg": "possible constraint violation: column has Check constraint", // (11) - "severity": "warning" // (12) -} -``` -{ .annotate } - -1. **Detailed metadata**. The validation output provides comprehensive metadata to pinpoint the source of problems. -2. **Column name** indicates the name of the affected column. -3. **Constraint definition** specifies the definition of the constraint that may be violated. -4. **Constraint name** identifies the name of the constraint that is potentially violated. -5. **Constraint schema name** indicates the schema in which the constraint is defined. -6. **Type of constraint** represents the type of constraint and can be one of the following: - ``` - * ForeignKey - * Check - * NotNull - * PrimaryKey - * PrimaryKeyReferences - * Unique - * Length - * Exclusion - * TriggerConstraint - ``` -7. **Table schema name** specifies the schema name of the affected table. -8. **Table name** identifies the name of the table where the problem occurs. -9. **Transformer name** indicates the name of the transformer responsible for the transformation. -10. **Name of affected parameter** typically, this is the name of the column parameter that is relevant to the - validation warning. -11. **Validation warning description** provides a detailed description of the validation warning and the reason behind - it. -12. **Severity of validation warning** indicates the severity level of the validation warning and can be one of the - following: - ``` - * error - * warning - * info - * debug - ``` -13. **Hash** is a unique identifier of the validation warning. It is used to resolve the warning in the config file - -!!! note - - A validation warning with a severity level of `"error"` is considered critical and must be addressed before the dump operation can proceed. Failure to resolve such warnings will prevent the dump operation from being executed. - -```text title="Schema diff changed output example" -2024-03-15T19:46:12+02:00 WRN Database schema has been changed Hint="Check schema changes before making new dump" PreviousDumpId=1710520855501 -2024-03-15T19:46:12+02:00 WRN Column renamed Event=ColumnRenamed Signature={"CurrentColumnName":"id1","PreviousColumnName":"id","TableName":"test","TableSchema":"public"} -2024-03-15T19:46:12+02:00 WRN Column type changed Event=ColumnTypeChanged Signature={"ColumnName":"id","CurrentColumnType":"bigint","CurrentColumnTypeOid":"20","PreviousColumnType":"integer","PreviousColumnTypeOid":"23","TableName":"test","TableSchema":"public"} -2024-03-15T19:46:12+02:00 WRN Column created Event=ColumnCreated Signature={"ColumnName":"name","ColumnType":"text","TableName":"test","TableSchema":"public"} -2024-03-15T19:46:12+02:00 WRN Table created Event=TableCreated Signature={"SchemaName":"public","TableName":"test1","TableOid":"20563"} -``` - -Example of validation diff: - -![img.png](assets/validate_horizontal_diff.png) - -The validation diff is presented in a neatly formatted table. In this table: - -* Columns that are affected by the transformation are highlighted with a red background. -* The pre-transformation values are displayed in green. -* The post-transformation values are shown in red. -* The result in `--format=text` can be displayed in either horizontal (`--table-format=horizontal`) or - vertical (`--table-format=vertical`) format, making it easy to visualize and understand the - differences between the original and transformed data. - -The whole validate command may be run in json format including logging making easy to parse the structure. - -```shell -greenmask --config=config.yml validate \ - --warnings \ - --data \ - --diff \ - --schema \ - --format=json \ - --table-format=vertical \ - --transformed-only \ - --rows-limit=1 \ - --log-format=json -``` - -The json object result - -=== "The validation warning" - - ```json - { - "level": "warn", - "ValidationWarning": { - "msg": "possible constraint violation: column has Check constraint", - "severity": "warning", - "meta": { - "ColumnName": "birthdate", - "ConstraintDef": "CHECK (birthdate >= '1930-01-01'::date AND birthdate <= (now() - '18 years'::interval))", - "ConstraintName": "humanresources", - "ConstraintSchema": "humanresources", - "ConstraintType": "Check", - "ParameterName": "column", - "SchemaName": "humanresources", - "TableName": "employee", - "TransformerName": "NoiseDate" - }, - "hash": "aa808fb574a1359c6606e464833feceb" - }, - "time": "2024-03-15T20:01:51+02:00" - } - ``` - -=== "Schema diff events" - - ```json - { - "level": "warn", - "PreviousDumpId": "1710520855501", - "Diff": [ - { - "event": "ColumnRenamed", - "signature": { - "CurrentColumnName": "id1", - "PreviousColumnName": "id", - "TableName": "test", - "TableSchema": "public" - } - }, - { - "event": "ColumnTypeChanged", - "signature": { - "ColumnName": "id", - "CurrentColumnType": "bigint", - "CurrentColumnTypeOid": "20", - "PreviousColumnType": "integer", - "PreviousColumnTypeOid": "23", - "TableName": "test", - "TableSchema": "public" - } - }, - { - "event": "ColumnCreated", - "signature": { - "ColumnName": "name", - "ColumnType": "text", - "TableName": "test", - "TableSchema": "public" - } - }, - { - "event": "TableCreated", - "signature": { - "SchemaName": "public", - "TableName": "test1", - "TableOid": "20563" - } - } - ], - "Hint": "Check schema changes before making new dump", - "time": "2024-03-15T20:01:51+02:00", - "message": "Database schema has been changed" - } - ``` - -=== "Transformation diff line" - - ```json - { - "schema": "humanresources", - "name": "employee", - "primary_key_columns": [ - "businessentityid" - ], - "with_diff": true, - "transformed_only": true, - "records": [ - { - "birthdate": { - "original": "1969-01-29", - "transformed": "1964-10-20", - "equal": false, - "implicit": true - }, - "businessentityid": { - "original": "1", - "transformed": "1", - "equal": true, - "implicit": true - } - } - ] - } - ``` - -## dump - -The `dump` command operates in the following way: - -1. Dumps the data from the source database. -2. Validates the data for potential issues. -3. Applies the defined transformations. -4. Stores the transformed data in the specified storage location. - -```text title="Supported flags" -Usage: - greenmask dump [flags] - -Flags: - -b, --blobs include large objects in dump - -c, --clean clean (drop) database objects before recreating - -Z, --compress int compression level for compressed formats (default -1) - -C, --create include commands to create database in dump - -a, --data-only dump only the data, not the schema - -d, --dbname string database to dump (default "postgres") - --disable-dollar-quoting disable dollar quoting, use SQL standard quoting - --disable-triggers disable triggers during data-only restore - --enable-row-security enable row security (dump only content user has access to) - -E, --encoding string dump the data in encoding ENCODING - -N, --exclude-schema strings dump the specified schema(s) only - -T, --exclude-table strings do NOT dump the specified table(s) - --exclude-table-data strings do NOT dump data for the specified table(s) - -e, --extension strings dump the specified extension(s) only - --extra-float-digits string override default setting for extra_float_digits - -f, --file string output file or directory name - -h, --host string database server host or socket directory (default "/var/run/postgres") - --if-exists use IF EXISTS when dropping objects - --include-foreign-data strings use IF EXISTS when dropping objects - -j, --jobs int use this many parallel jobs to dump (default 1) - --load-via-partition-root load partitions via the root table - --lock-wait-timeout int fail after waiting TIMEOUT for a table lock (default -1) - -B, --no-blobs exclude large objects in dump - --no-comments do not dump comments - -O, --no-owner string skip restoration of object ownership in plain-text format - -X, --no-privileges do not dump privileges (grant/revoke) - --no-publications do not dump publications - --no-security-labels do not dump security label assignments - --no-subscriptions do not dump subscriptions - --no-sync do not wait for changes to be written safely to dis - --no-synchronized-snapshots do not use synchronized snapshots in parallel jobs - --no-tablespaces do not dump tablespace assignments - --no-toast-compression do not dump TOAST compression methods - --no-unlogged-table-data do not dump unlogged table data - --on-conflict-do-nothing add ON CONFLICT DO NOTHING to INSERT commands - -p, --port int database server port number (default 5432) - --quote-all-identifiers quote all identifiers, even if not key words - -n, --schema strings dump the specified schema(s) only - -s, --schema-only string dump only the schema, no data - --section string dump named section (pre-data, data, or post-data) - --serializable-deferrable wait until the dump can run without anomalies - --snapshot string use given snapshot for the dump - --strict-names require table and/or schema include patterns to match at least one entity each - -S, --superuser string superuser user name to use in plain-text format - -t, --table strings dump the specified table(s) only - --test string connect as specified database user (default "postgres") - --use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership - -U, --username string connect as specified database user (default "postgres") - -v, --verbose string verbose mode - --pgzip use pgzip compression instead of gzip -``` - -## list-dumps - -The `list-dumps` command provides a list of all dumps stored in the storage. The list includes the following attributes: - -* `ID` — the unique identifier of the dump, used for operations like `restore`, `delete`, and `show-dump` -* `DATE` — the date when the snapshot was created -* `DATABASE` — the name of the database associated with the dump -* `SIZE` — the original size of the dump -* `COMPRESSED SIZE` — the size of the dump after compression -* `DURATION` — the duration of the dump procedure -* `TRANSFORMED` — indicates whether the dump has been transformed -* `STATUS` — the status of the dump, which can be one of the following: - * `done` — the dump was completed successfully - * `unknown` or `failed` — the dump might be in progress or failed. Failed dumps are not deleted automatically. - -Example of `list-dumps` output: -![list_dumps_screen.png](assets/list_dumps_screen.png) - -## list-transformers - -The `list-transformers` command provides a list of all the allowed transformers, including both standard and advanced -transformers. This list can be helpful for searching for an appropriate transformer for your data transformation needs. - -To show a list of available transformers, use the following command: - -```shell -greenmask --config=config.yml list-transformers -``` - -Supported flags: - -* `--format` — allows to select the output format. There are two options available: `text` or `json`. The - default setting is `text`. - -Example of `list-transformers` output: - -![list_transformers_screen.png](assets/list_transformers_screen_2.png) - -When using the `list-transformers` command, you receive a list of available transformers with essential information -about each of them. Below are the key parameters for each transformer: - -* `NAME` — the name of the transformer -* `DESCRIPTION` — a brief description of what the transformer does -* `COLUMN PARAMETER NAME` — name of a column or columns affected by transformation -* `SUPPORTED TYPES` — list the supported value types - -The JSON call `greenmask --config=config.yml list-transformers --format=json` has the same attributes: - -```json title="JSON format output" -[ - { - "name": "Cmd", - "description": "Transform data via external program using stdin and stdout interaction", - "parameters": [ - { - "name": "columns", - "supported_types": [ - "any" - ] - } - ] - }, - { - "name": "Dict", - "description": "Replace values matched by dictionary keys", - "parameters": [ - { - "name": "column", - "supported_types": [ - "any" - ] - } - ] - } -] -``` - -## show-transformer - -This command prints out detailed information about a transformer by a provided name, including specific attributes to -help you understand and configure the transformer effectively. - -To show detailed information about a transformer, use the following command: - -```shell -greenmask --config=config.yml show-transformer TRANSFORMER_NAME -``` - -Supported flags: - -* `--format` — allows to select the output format. There are two options available: `text` or `json`. The - default setting is `text`. - -Example of `show-transformer` output: - -![show_transformer.png](assets/show_transformer.png) - -When using the `show-transformer` command, you receive detailed information about the transformer and its parameters and -their possible attributes. Below are the key parameters for each transformer: - -* `Name` — the name of the transformer -* `Description` — a brief description of what the transformer does -* `Parameters` — a list of transformer parameters, each with its own set of attributes. Possible attributes include: - - * `description` — a brief description of the parameter's purpose - * `required` — a flag indicating whether the parameter is required when configuring the transformer - * `link_parameter` — specifies whether the value of the parameter will be encoded using a specific parameter type - encoder. For example, if a parameter named `column` is linked to another parameter `start`, the `start` - parameter's value will be encoded according to the `column` type when the transformer is initialized. - * `cast_db_type` — indicates that the value should be encoded according to the database type. For example, when - dealing with the INTERVAL data type, you must provide the interval value in PostgreSQL format. - * `default_value` — the default value assigned to the parameter if it's not provided during configuration. - * `column_properties` — if a parameter represents the name of a column, it may contain additional properties, - including: - * `nullable` — indicates whether the transformer may produce NULL values, potentially violating the NOT NULL - constraint - * `unique` — specifies whether the transformer guarantees unique values for each call. If set to `true`, it - means that the transformer cannot produce duplicate values, ensuring compliance with the UNIQUE constraint. - * `affected` — indicates whether the column is affected during the transformation process. If not affected, the - column's value might still be required for transforming another column. - * `allowed_types` — a list of data types that are compatible with this parameter - * `skip_original_data` — specifies whether the original value of the column, before transformation, is relevant - for the transformation process - * `skip_on_null` — indicates whether the transformer should skip the transformation when the input column value - is NULL. If the column value is NULL, interaction with the transformer is unnecessary. - -!!! warning - - The default value in JSON format is base64 encoded. This might be changed in later version of Greenmask. - -```json title="JSON output example" -[ - { - "properties": { - "name": "NoiseFloat", - "description": "Make noise float for int", - "is_custom": false - }, - "parameters": [ - { - "name": "column", - "description": "column name", - "required": true, - "is_column": true, - "is_column_container": false, - "column_properties": { - "max_length": -1, - "affected": true, - "allowed_types": [ - "float4", - "float8", - "numeric" - ], - "skip_on_null": true - } - }, - { - "name": "ratio", - "description": "max random percentage for noise", - "required": false, - "is_column": false, - "is_column_container": false, - "default_value": "MC4x" - }, - { - "name": "decimal", - "description": "decimal of noised float value (number of digits after coma)", - "required": false, - "is_column": false, - "is_column_container": false, - "default_value": "NA==" - } - ] - } -] -``` - -## restore - -To perform a dump restoration with the provided dump ID, use the following command: - -```shell -greenmask --config=config.yml restore DUMP_ID -``` - -Alternatively, to restore the latest completed dump, use the following command: - -```shell -greenmask --config=config.yml restore latest -``` - -Note that the `restore` command shares the same parameters and environment variables as `pg_restore`, -allowing you to configure the restoration process as needed. - -```text title="Supported flags" -Flags: - -c, --clean clean (drop) database objects before recreating - -C, --create create the target database - -a, --data-only restore only the data, no schema - -d, --dbname string connect to database name (default "postgres") - --disable-triggers disable triggers during data-only restore - --enable-row-security enable row security - -N, --exclude-schema strings do not restore objects in this schema - -e, --exit-on-error exit on error, default is to continue - -f, --file string output file name (- for stdout) - -P, --function strings restore named function - -h, --host string database server host or socket directory (default "/var/run/postgres") - --if-exists use IF EXISTS when dropping objects - -i, --index strings restore named index - -j, --jobs int use this many parallel jobs to restore (default 1) - --list-format string use table of contents in format of text, json or yaml (default "text") - --no-comments do not restore comments - --no-data-for-failed-tables do not restore data of tables that could not be created - -O, --no-owner string skip restoration of object ownership - -X, --no-privileges skip restoration of access privileges (grant/revoke) - --no-publications do not restore publications - --no-security-labels do not restore security labels - --no-subscriptions ddo not restore subscriptions - --no-table-access-method do not restore table access methods - --no-tablespaces do not restore tablespace assignments - -p, --port int database server port number (default 5432) - -n, --schema strings restore only objects in this schema - -s, --schema-only string restore only the schema, no data - --section string restore named section (pre-data, data, or post-data) - -1, --single-transaction restore as a single transaction - --strict-names restore named section (pre-data, data, or post-data) match at least one entity each - -S, --superuser string superuser user name to use for disabling triggers - -t, --table strings restore named relation (table, view, etc.) - -T, --trigger strings restore named trigger - -L, --use-list string use table of contents from this file for selecting/ordering output - --use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership - -U, --username string connect as specified database user (default "postgres") - -v, --verbose string verbose mode - --pgzip use pgzip decompression instead of gzip - -``` - -## show-dump - -This command provides details about all objects and data that can be restored, similar to the `pg_restore -l` command in -PostgreSQL. It helps you inspect the contents of the dump before performing the actual restoration. - -Parameters: - -* `--format` — format of printing. Can be `text` or `json`. - -To display metadata information about a dump, use the following command: - -```shell -greenmask --config=config.yml show-dump dumpID -``` - -=== "Text output example" -```text -; -; Archive created at 2023-10-30 12:52:38 UTC -; dbname: demo -; TOC Entries: 17 -; Compression: -1 -; Dump Version: 15.4 -; Format: DIRECTORY -; Integer: 4 bytes -; Offset: 8 bytes -; Dumped from database version: 15.4 -; Dumped by pg_dump version: 15.4 -; -; -; Selected TOC Entries: -; -3444; 0 0 ENCODING - ENCODING -3445; 0 0 STDSTRINGS - STDSTRINGS -3446; 0 0 SEARCHPATH - SEARCHPATH -3447; 1262 24970 DATABASE - demo postgres -3448; 0 0 DATABASE PROPERTIES - demo postgres -222; 1259 24999 TABLE bookings flights postgres -223; 1259 25005 SEQUENCE bookings flights_flight_id_seq postgres -3460; 0 0 SEQUENCE OWNED BY bookings flights_flight_id_seq postgres -3281; 2604 25030 DEFAULT bookings flights flight_id postgres -3462; 0 24999 TABLE DATA bookings flights postgres -3289; 2606 25044 CONSTRAINT bookings flights flights_flight_no_scheduled_departure_key postgres -3291; 2606 25046 CONSTRAINT bookings flights flights_pkey postgres -3287; 1259 42848 INDEX bookings flights_aircraft_code_status_idx postgres -3292; 1259 42847 INDEX bookings flights_status_aircraft_code_idx postgres -3293; 2606 25058 FK CONSTRAINT bookings flights flights_aircraft_code_fkey postgres -3294; 2606 25063 FK CONSTRAINT bookings flights flights_arrival_airport_fkey postgres -3295; 2606 25068 FK CONSTRAINT bookings flights flights_departure_airport_fkey postgres -``` -=== "JSON output example" - - ```json linenums="1" - { - "startedAt": "2023-10-29T20:50:19.948017+02:00", // (1) - "completedAt": "2023-10-29T20:50:22.19333+02:00", // (2) - "originalSize": 4053842, // (3) - "compressedSize": 686557, // (4) - "transformers": [ // (5) - { - "Schema": "bookings", // (6) - "Name": "flights", // (7) - "Query": "", // (8) - "Transformers": [ // (9) - { - "Name": "RandomDate", // (10) - "Params": { // (11) - "column": "c2NoZWR1bGVkX2RlcGFydHVyZQ==", - "max": "MjAyMy0wMS0wMiAwMDowMDowMC4wKzAz", - "min": "MjAyMy0wMS0wMSAwMDowMDowMC4wKzAz" - } - } - ], - "ColumnsTypeOverride": null // (12) - } - ], - "header": { // (13) - "creationDate": "2023-10-29T20:50:20+02:00", - "dbName": "demo", - "tocEntriesCount": 15, - "dumpVersion": "16.0 (Homebrew)", - "format": "TAR", - "integer": 4, - "offset": 8, - "dumpedFrom": "16.0 (Debian 16.0-1.pgdg120+1)", - "dumpedBy": "16.0 (Homebrew)", - "tocFileSize": 8090, - "compression": 0 - }, - "entries": [ // (14) - { - "dumpId": 3416, - "databaseOid": 0, - "objectOid": 0, - "objectType": "ENCODING", - "schema": "", - "name": "ENCODING", - "owner": "", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": null - }, - { - "dumpId": 3417, - "databaseOid": 0, - "objectOid": 0, - "objectType": "STDSTRINGS", - "schema": "", - "name": "STDSTRINGS", - "owner": "", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": null - }, - { - "dumpId": 3418, - "databaseOid": 0, - "objectOid": 0, - "objectType": "SEARCHPATH", - "schema": "", - "name": "SEARCHPATH", - "owner": "", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": null - }, - { - "dumpId": 3419, - "databaseOid": 16384, - "objectOid": 1262, - "objectType": "DATABASE", - "schema": "", - "name": "demo", - "owner": "postgres", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": null - }, - { - "dumpId": 3420, - "databaseOid": 0, - "objectOid": 0, - "objectType": "DATABASE PROPERTIES", - "schema": "", - "name": "demo", - "owner": "postgres", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": null - }, - { - "dumpId": 222, - "databaseOid": 16414, - "objectOid": 1259, - "objectType": "TABLE", - "schema": "bookings", - "name": "flights", - "owner": "postgres", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": null - }, - { - "dumpId": 223, - "databaseOid": 16420, - "objectOid": 1259, - "objectType": "SEQUENCE", - "schema": "bookings", - "name": "flights_flight_id_seq", - "owner": "postgres", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 222 - ] - }, - { - "dumpId": 3432, - "databaseOid": 0, - "objectOid": 0, - "objectType": "SEQUENCE OWNED BY", - "schema": "bookings", - "name": "flights_flight_id_seq", - "owner": "postgres", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 223 - ] - }, - { - "dumpId": 3254, - "databaseOid": 16445, - "objectOid": 2604, - "objectType": "DEFAULT", - "schema": "bookings", - "name": "flights flight_id", - "owner": "postgres", - "section": "PreData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 223, - 222 - ] - }, - { - "dumpId": 3434, - "databaseOid": 16414, - "objectOid": 0, - "objectType": "TABLE DATA", - "schema": "\"bookings\"", - "name": "\"flights\"", - "owner": "\"postgres\"", - "section": "Data", - "originalSize": 4045752, - "compressedSize": 678467, - "fileName": "3434.dat.gz", - "dependencies": [] - }, - { - "dumpId": 3261, - "databaseOid": 16461, - "objectOid": 2606, - "objectType": "CONSTRAINT", - "schema": "bookings", - "name": "flights flights_flight_no_scheduled_departure_key", - "owner": "postgres", - "section": "PostData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 222, - 222 - ] - }, - { - "dumpId": 3263, - "databaseOid": 16463, - "objectOid": 2606, - "objectType": "CONSTRAINT", - "schema": "bookings", - "name": "flights flights_pkey", - "owner": "postgres", - "section": "PostData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 222 - ] - }, - { - "dumpId": 3264, - "databaseOid": 16477, - "objectOid": 2606, - "objectType": "FK CONSTRAINT", - "schema": "bookings", - "name": "flights flights_aircraft_code_fkey", - "owner": "postgres", - "section": "PostData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 222 - ] - }, - { - "dumpId": 3265, - "databaseOid": 16482, - "objectOid": 2606, - "objectType": "FK CONSTRAINT", - "schema": "bookings", - "name": "flights flights_arrival_airport_fkey", - "owner": "postgres", - "section": "PostData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 222 - ] - }, - { - "dumpId": 3266, - "databaseOid": 16487, - "objectOid": 2606, - "objectType": "FK CONSTRAINT", - "schema": "bookings", - "name": "flights flights_departure_airport_fkey", - "owner": "postgres", - "section": "PostData", - "originalSize": 0, - "compressedSize": 0, - "fileName": "", - "dependencies": [ - 222 - ] - } - ] - } - ``` - { .annotate } - - 1. The date when the backup has been initiated, also indicating the snapshot date. - 2. The date when the backup process was successfully completed. - 3. The original size of the backup in bytes. - 4. The size of the backup after compression in bytes. - 5. A list of tables that underwent transformation during the backup. - 6. The schema name of the table. - 7. The name of the table. - 8. Custom query override, if applicable. - 9. A list of transformers that were applied during the backup. - 10. The name of the transformer. - 11. The parameters provided for the transformer. - 12. A mapping of overridden column types. - 13. The header information in the table of contents file. This provides the same details as the `--format=text` output in the previous snippet. - 14. The list of restoration entries. This offers the same information as the `--format=text` output in the previous snippet. - -!!! note - - The `json` format provides more detailed information compared to the `text` format. The `text` format is primarily used for backward compatibility and for generating a restoration list that can be used with `pg_restore -L listfile`. On the other hand, the `json` format provides comprehensive metadata about the dump, including information about the applied transformers and their parameters. The `json` format is especially useful for detailed dump introspection. diff --git a/docs/commands/dump.md b/docs/commands/dump.md index 95b8ffc1..9be7786e 100644 --- a/docs/commands/dump.md +++ b/docs/commands/dump.md @@ -13,10 +13,6 @@ allowing you to configure the restoration process as needed. Mostly it supports the same flags as the `pg_dump` utility, with some extra flags for Greenmask-specific features. ```text title="Supported flags" -Usage: - greenmask dump [flags] - -Flags: -b, --blobs include large objects in dump -c, --clean clean (drop) database objects before recreating -Z, --compress int compression level for compressed formats (default -1) @@ -51,7 +47,7 @@ Flags: --no-tablespaces do not dump tablespace assignments --no-toast-compression do not dump TOAST compression methods --no-unlogged-table-data do not dump unlogged table data - --on-conflict-do-nothing add ON CONFLICT DO NOTHING to INSERT commands + --pgzip use pgzip compression instead of gzip -p, --port int database server port number (default 5432) --quote-all-identifiers quote all identifiers, even if not key words -n, --schema strings dump the specified schema(s) only @@ -66,4 +62,11 @@ Flags: --use-set-session-authorization use SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to set ownership -U, --username string connect as specified database user (default "postgres") -v, --verbose string verbose mode -``` \ No newline at end of file +``` + +### Pgzip compression + +By default, Greenmask uses gzip compression to restore data. In mist cases it is quite slow and does not utilize all +available resources and is a bootleneck for IO operations. To speed up the restoration process, you can use +the `--pgzip` flag to use pgzip compression instead of gzip. This method splits the data into blocks, which are +compressed in parallel, making it ideal for handling large volumes of data. The output remains a standard gzip file. diff --git a/docs/commands/restore.md b/docs/commands/restore.md index 081094b0..6db4a469 100644 --- a/docs/commands/restore.md +++ b/docs/commands/restore.md @@ -18,7 +18,6 @@ allowing you to configure the restoration process as needed. Mostly it supports the same flags as the `pg_restore` utility, with some extra flags for Greenmask-specific features. ```text title="Supported flags" -Flags: -c, --clean clean (drop) database objects before recreating -C, --create create the target database -a, --data-only restore only the data, no schema @@ -45,6 +44,7 @@ Flags: --no-table-access-method do not restore table access methods --no-tablespaces do not restore tablespace assignments --on-conflict-do-nothing add ON CONFLICT DO NOTHING to INSERT commands + --pgzip use pgzip decompression instead of gzip -p, --port int database server port number (default 5432) --restore-in-order restore tables in topological order, ensuring that dependent tables are not restored until the tables they depend on have been restored -n, --schema strings restore only objects in this schema @@ -106,3 +106,11 @@ If your database has cyclic dependencies you will be notified about it but the r ```text 2024-08-16T21:39:50+03:00 WRN cycle between tables is detected: cannot guarantee the order of restoration within cycle cycle=["public.employees","public.departments","public.projects","public.employees"] ``` + +### Pgzip decompression + +By default, Greenmask uses gzip decompression to restore data. In mist cases it is quite slow and does not utilize all +available resources and is a bootleneck for IO operations. To speed up the restoration process, you can use +the `--pgzip` flag to use pgzip decompression instead of gzip. This method splits the data into blocks, which are +decompressed in parallel, making it ideal for handling large volumes of data. The output remains a standard gzip file. + diff --git a/docs/configuration.md b/docs/configuration.md index 4c7d14a1..6215c71b 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -85,7 +85,7 @@ two storage `type` options are supported: `directory` and `s3`. In the `dump` section of the configuration, you configure the `greenmask dump` command. It includes the following parameters: -* `pg_dump_options` — a map of `pg_dump` options to configure the behavior of the command itself. You can refer to the list of supported `pg_dump` options in the [Greenmask dump command documentation](commands.md#dump). +* `pg_dump_options` — a map of `pg_dump` options to configure the behavior of the command itself. You can refer to the list of supported `pg_dump` options in the [Greenmask dump command documentation](commands/dump.md). * `transformation` — this section contains configuration for applying transformations to table columns during the dump operation. It includes the following sub-parameters: * `schema` — the schema name of the table @@ -225,10 +225,10 @@ validate: 1. A list of tables to validate. If this list is not empty, the validation operation will only be performed for the specified tables. Tables can be written with or without the schema name (e. g., `"public.cart"` or `"orders"`). 2. Specifies whether to perform data transformation for a limited set of rows. If set to `true`, data transformation will be performed, and the number of rows transformed will be limited to the value specified in the `rows_limit` parameter (default is `10`). -3. Specifies whether to perform diff operations for the transformed data. If set to `true`, the validation process will **find the differences between the original and transformed data**. See more details in the [validate command documentation](commands.md/#validate). +3. Specifies whether to perform diff operations for the transformed data. If set to `true`, the validation process will **find the differences between the original and transformed data**. See more details in the [validate command documentation](commands/validate.md). 4. Limits the number of rows to be transformed during validation. The default limit is `10` rows, but you can change it by modifying this parameter. 5. A hash list of resolved warnings. These warnings have been addressed and resolved in a previous validation run. -6. Specifies the format of the transformation output. Possible values are `[horizontal|vertical]`. The default format is `horizontal`. You can choose the format that suits your needs. See more details in the [validate command documentation](commands.md/#validate). +6. Specifies the format of the transformation output. Possible values are `[horizontal|vertical]`. The default format is `horizontal`. You can choose the format that suits your needs. See more details in the [validate command documentation](commands/validate.md). 7. The output format (json or text) 8. Specifies whether to validate the schema current schema with the previous and print the differences if any. 9. If set to `true`, transformation output will be only with the transformed columns and primary keys @@ -239,7 +239,7 @@ validate: In the `restore` section of the configuration, you can specify parameters for the `greenmask restore` command. It contains `pg_restore` settings and custom script execution settings. Below you can find the available parameters: * `pg_restore_options` — a map of `pg_restore` options that are used to configure the behavior of - the `pg_restore` utility during the restoration process. You can refer to the list of supported `pg_restore` options in the [Greenmask restore command documentation](commands.md#restore). + the `pg_restore` utility during the restoration process. You can refer to the list of supported `pg_restore` options in the [Greenmask restore command documentation](commands/restore.md). * `scripts` — a map of custom scripts to be executed during different restoration stages. Each script is associated with a specific restoration stage and includes the following attributes: * `[pre-data|data|post-data]` — the name of the restoration stage when the script should be executed; has the following parameters: * `name` — the name of the script diff --git a/docs/playground.md b/docs/playground.md index 23f51a53..b367f679 100644 --- a/docs/playground.md +++ b/docs/playground.md @@ -54,7 +54,7 @@ Below you can see Greenmask commands: * `completion` — generates the autocompletion script for the specified shell. -To learn more about them, see [Commands](commands.md). +To learn more about them, see [Commands](commands/index.md). ## Transformers diff --git a/docs/release_notes/greenmask_0_1_5.md b/docs/release_notes/greenmask_0_1_5.md index bf94a6e1..f9c04cc3 100644 --- a/docs/release_notes/greenmask_0_1_5.md +++ b/docs/release_notes/greenmask_0_1_5.md @@ -4,7 +4,7 @@ This release introduces a new Greenmask command, improvements, bug fixes, and nu ## New features -Added a new Greenmask CLI command—[show-transformer](../commands.md#show-transformer) that shows detailed information about a specified transformer. +Added a new Greenmask CLI command—[show-transformer](../commands/show-transformer.md) that shows detailed information about a specified transformer. ## Improvements