First-class COPY support #62

jasonmp85 · 2015-01-26T17:34:37Z

This ticket is to track full support for the COPY command. Unlike the trigger implementation in #61, this would mean supporting a bulk method for data ingestion. Issues like consistency and isolation will show up, as well as failure modes.

The text was updated successfully, but these errors were encountered:

ozgune · 2015-01-26T22:56:52Z

Hey, I'm adding my notes here on COPY support.

I think some of the points here could relate to #61 @jasonmp85 , if you see items in here that are relevant for #61, could you copy+paste them?

How does one invoke the COPY operation? In cstore_fdw, we intercept the utility hook. If we have a COPY command, we then route that logic to our insert function.
How do we process options (format, delimiter, null character, etc.) to copy? If we intercept the utility hook, this happens automatically.
What happens when we observe a failure? I imagine two types of failures: (a) bad data, and (b) can't send request to any of the replicas. The first error happens much more frequently.

On the last item, this has been heavily discussed in the context of PostgreSQL too: https://wiki.postgresql.org/wiki/Error_logging_in_COPY

Proprietary databases that extend PostgreSQL usually set a threshold for COPY errors. For example, if the COPY observes 5 errors in one file (or 1% of rows), it stops altogether. Otherwise, COPY tries to continue data loading.

jasonmp85 · 2015-02-02T23:14:45Z

I think the most difficult problem to overcome will be "what happens when a replica fails partway through", not "what happens when you can't send data to any replica". Do we rollback entirely, or do we support partial data loads (which is a feature not directly supported by the existing COPY interface within PostgreSQL)?

We can mark a shard as bad if it has a failure, but what about the other shards? Do we finish ingesting the data to them all? If so, how does the user fill in the missing data while omitting the shards that have already been processed? These are questions we'll need to answer for any usable implementation.

jasonmp85 · 2015-02-02T23:16:10Z

@marcocitus mentioned pgloader the other day… maybe we can look at it for inspiration re: partial failures or ignore-and-continue semantics.

rsolari · 2015-03-05T18:38:55Z

Hi,
I'm following this issue and #61 with interest. Have you decided on the failure modes?

With the current version of pg_shard, we're planning to INSERT one row at a time instead of using a COPY trigger, because it's hard to recover from a failed COPY trigger. I'd like to mirror your COPY failure modes with our INSERT failure modes, so that it'll be easier to migrate our INSERT to a COPY when the time comes.

jasonmp85 · 2015-03-05T18:45:06Z

What makes recovering from a failed COPY trigger difficult? I believe we were careful to output the number of rows copied, which should allow a caller to resume at a certain row number to continue the operation. That's the short-term plan at the moment (output total number of rows copied, to reflect the contiguous successes from the beginning of the input file).

rsolari · 2015-03-05T18:58:23Z

We couldn't safely parallelize different instances of the copy_to_insert function, so we couldn't keep track of the count of rows copied. If we create the function only once, we get safe parallelism, but we lose the counts.

This answer is related to my comments on #61 last week. Maybe it'd be more on-topic there?

jasonmp85 · 2015-03-05T19:02:44Z

Yeah let's move there.

mvanderlee · 2015-11-07T03:31:42Z

+1 I wanted to combine bdr with pg_shard to have a multi-cluster setup. But bdr uses copy for at least the initial data dump and thus preventing me from setting this up.

jasonmp85 added the feature label Feb 2, 2015

jasonmp85 changed the title ~~COPY support~~ First-class COPY support Feb 2, 2015

jasonmp85 mentioned this issue Mar 13, 2015

Support multi-statement transactions #87

Open

jasonmp85 added the waffle: backlog label Mar 30, 2015

onderkalaci mentioned this issue Apr 2, 2015

copy_to_distributed_table in contrib package fails to mark inactive shards #95

Open

pnorman mentioned this issue Apr 29, 2015

Import data to sharding PostgreSQL server by Osm2pgsql osm2pgsql-dev/osm2pgsql#347

Closed

jasonmp85 removed the waffle: backlog label Jul 27, 2015

jasonmp85 added this to the v1.3 milestone Sep 2, 2015

This was referenced Feb 4, 2016

First-class COPY support citusdata/citus#225

Closed

copy_to_distributed_table in contrib package fails to mark inactive shards citusdata/citus#235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First-class COPY support #62

First-class COPY support #62

jasonmp85 commented Jan 26, 2015

ozgune commented Jan 26, 2015

jasonmp85 commented Feb 2, 2015

jasonmp85 commented Feb 2, 2015

rsolari commented Mar 5, 2015

jasonmp85 commented Mar 5, 2015

rsolari commented Mar 5, 2015

jasonmp85 commented Mar 5, 2015

mvanderlee commented Nov 7, 2015

First-class COPY support #62

First-class COPY support #62

Comments

jasonmp85 commented Jan 26, 2015

ozgune commented Jan 26, 2015

jasonmp85 commented Feb 2, 2015

jasonmp85 commented Feb 2, 2015

rsolari commented Mar 5, 2015

jasonmp85 commented Mar 5, 2015

rsolari commented Mar 5, 2015

jasonmp85 commented Mar 5, 2015

mvanderlee commented Nov 7, 2015