Document/Improve: Continuous batching/chunking of a stream #118

jmealo · 2021-03-21T19:11:36Z

I'm creating this issue to discuss solutions (documentation or otherwise) to help folks avoid some of the non-memory related pitfalls on #116 when continuously batching COPY commands.

See:
#116 (comment)

@jeromew: I found out what happens if you always have a COPY in progress. It works great until something finally manages to get a lock on the table and then you have a bunch of COPY operations pending. From a locking perspective, it's really not ideal to start a COPY, stream data into it, close that stream, start a new COPY stream an immediately pipe into it, you're effectively never yielding the table for an exclusive lock.

🔥 If you do not give PostgreSQL some breathing room between COPY operations; you're going to run into 🔒 locking issues 🔒.

I'm interested if anyone has any ideas for the best ways to handle this entirely. I'm going to dump some tips for anyone working on this in the future.

Pro-tip: Safely create async batches

Under high load, a new event may arrive before the previous event's call to getBatch() resolves. This can lead to orphaned batches (memory leak) and lost messages.
const getBatch = async () => {
  // If another batch is already pending; we return the promise that will resolve to that batch.
  // You can .then() or await the same promise multiple times; you cannot resolve()/reject() multiple values
  if (nextBatch) {
    return nextBatch
  }

  try {
    nextBatch = producer.createBatch({ maxSizeInBytes })
    const batch = await nextBatch
    return batch
  } finally {
    nextBatch = undefined
  }
}
Anti-pattern: Immediately unpipe/piping another stream into rotating COPY streams
// Ooops! If you immediately rotate streams you will always have a lock on the table (until you don't and then COPY will pile up in a `WAITING` state, especially if you have control/metadata queries interleaved between)
const rotateCopyStream = async (logStream) => {
  // disconnecting the csv stream from our log stream, causes log stream to pause
  if (csvStream) logStream.unpipe(csvStream)

  if (pgStream) {
    // disconnecting the pg stream from the csv stream causes the csv stream to end
    csvStream.unpipe(pgStream)
    pgStream.end() // calling this ends the COPY command
  }

  // Get connection and execute our copy command
  const client = await pool.connect()
  pgStream = client.query(copyFrom(COPY_QUERY))

  // Debugging helps
  pgStream.on('error', (error) => {
    if (client) client.release()
    console.log('pgSteam -> stream errored')
    console.log(error)
  })

  pgStream.on('finish', () => {
    if (client) client.release()
    console.log('pgStream -> stream complete')
  })

  csvStream = format({headers: false, includeEndRowDelimiter: true, delimiter: '\t'})
  logStream.pipe(csvStream, {end: false}).pipe(pgStream)
}
@jbtobar: I'm using 100% streams in my workflow, but I've found a Node.js foot canon when batching things coming from an event emitter/event handler/subscription. If batch creation is async, you need to make sure if a second event comes in while the batches are being rotated, that you don't end up creating two or more batches. An example of how to accomplish this is below:

I considered putting an advisory lock at the beginning of the COPY command. This would give us a chance to keep our input stream paused and resume flow when the last COPY is complete with the possibility of application-level logic. I haven't tried it yet, but, that's my next plan of attack.

The text was updated successfully, but these errors were encountered:

jeromew · 2021-08-20T16:18:53Z

Regarding the "antipattern" that is mentioned in #118 (comment)

pgStream.end() // calling this ends the COPY command

Calling end() only starts the process of ending the COPY operation (it sends CopyDone). You probably should not call

pgStream = client.query(copyFrom(COPY_QUERY))

and pipe into the new rotated stream until the previous stream emits 'finish'.

In the code, what happens is :

COPY -> COMMITING
     -> COPY

when what you want in order to leave more breathing room to the backend may be

COPY -> COMMITING -> COPY

jeromew · 2021-08-20T16:41:15Z

Just to make it clear this issue is for discussing a pattern that attracts some people because of the possibilities given by pg-copy-stream

Some people want to create a long-lived ingestion mechanism to ingest data into postgresql. For example, you could imagine that you have a never ending stream of logs that you want to ingest into postgresql.

this could be done by

inserting log lines one by one via INSERT commands
inserting batched volumes of log lines (let's say 100s) via MULTI-INSERT commands. cf "insert multiple rows" on https://www.postgresql.org/docs/14/dml-insert.html

there are probably other ways to do that with postgresql, including mechanisms linked with replication.

but still it can be interesting to see if and how COPY can help for this kind of scenarios.

first it is necessary to understand that a COPY FROM operation ingest rows. During the operation, the rows are not visible. They will become visible only once the COPY operation is terminated.

So pg-copy-stream cannot be used to open a long-lived stream, push data into it, and expect that this data will be queryable. For the data to be visible and queryable, you need to finish the COPY operation.

Once this is said, the idea of rotated COPY operations appear : ingest N rows via COPY and then create a new COPY operation.

It could be interesting to agree on advantages and disadvantages of the different solutions that can be used to do that and see how it impacts the performance of postgresql depending on the throughput of the ingested data, and how the throughput of out-of-band queries that need to be executed on the data.

I will leave this issue open for several months to see if people want to share their experience, have ideas around this, and discuss this further.

jmealo · 2024-04-26T13:44:25Z

@jeromew: Howdy, here I am bulk ingesting data into PostgreSQL again 😃 ... Multi-INSERT commands seems slow compared to copy but I haven't benchmarked it yet.

I'm realizing that a library that wraps pg-copy-streams that adds the logic for safely handling batches of bulk inserts, that flush for visibility and explains everything this issue does could be a good compromise.

If there was any interest in linking to it from this library, I'd gladly consider implementing it. Thoughts? @jeromew @brianc

jeromew · 2024-05-03T20:21:30Z

Hello,

I understand the need to find a nice streaming ingestion mechanism for long-lived streams and mega-huge ingested filed with regular visibility on the data beeing ingested (using one COPY operation only gives visibility after the COPY is commited).

it would indeed be interesting to benchmark COPY vs Multi-Insert.

by Multi-Insert for N rows, I do not mean N INSERT operations, "1 insert per row", but I mean the ability that postgres has to insert several rows as a batch in one operation - cf https://www.postgresql.org/docs/current/dml-insert.html and thus send 1 INSERT with N rows

You can insert multiple rows in a single command:

INSERT INTO products (product_no, name, price) VALUES
    (1, 'Cheese', 9.99),
    (2, 'Bread', 1.99),
    (3, 'Milk', 2.99);

The TIP on this page says

When inserting a lot of data at the same time, consider using the COPY command. It is not as flexible as the INSERT command, but is more efficient.

I fully agree for 1 COPY versus INSERTs but if the regular visibility is important to you I would be surprised that the time spent setting up regular COPY operation for N rows will outperform a regular INSERT with N rows, and even more so if the INSERT N rows is a prepared statement

A specificity also when using COPY is that copy-from does not require the incoming stream to be respectful of row boundaries (Postgres has everything it needs to detect row boundaries and reconstruct rows from streamed network chunks) but in your solution you would need to either add a row boundary detection mechanism or require incoming streams to be respectful of row boundaries.

Note also that if the stream is an object stream with 1 row per chunk, it is important performance-wise to recombine the rows into bigger network chunks as per #127

jeromew mentioned this issue Aug 20, 2021

Memory allocation failed - continouos data feed stream #116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document/Improve: Continuous batching/chunking of a stream #118

Document/Improve: Continuous batching/chunking of a stream #118

jmealo commented Mar 21, 2021 •

edited

Loading

🔥 If you do not give PostgreSQL some breathing room between COPY operations; you're going to run into 🔒 locking issues 🔒.

Pro-tip: Safely create async batches

Anti-pattern: Immediately unpipe/piping another stream into rotating `COPY` streams

jeromew commented Aug 20, 2021 •

edited

Loading

jeromew commented Aug 20, 2021

jmealo commented Apr 26, 2024

jeromew commented May 3, 2024 •

edited

Loading

Document/Improve: Continuous batching/chunking of a stream #118

Document/Improve: Continuous batching/chunking of a stream #118

Comments

jmealo commented Mar 21, 2021 • edited Loading

🔥 If you do not give PostgreSQL some breathing room between COPY operations; you're going to run into 🔒 locking issues 🔒.

Pro-tip: Safely create async batches

Anti-pattern: Immediately unpipe/piping another stream into rotating COPY streams

jeromew commented Aug 20, 2021 • edited Loading

jeromew commented Aug 20, 2021

jmealo commented Apr 26, 2024

jeromew commented May 3, 2024 • edited Loading

jmealo commented Mar 21, 2021 •

edited

Loading

Anti-pattern: Immediately unpipe/piping another stream into rotating `COPY` streams

jeromew commented Aug 20, 2021 •

edited

Loading

jeromew commented May 3, 2024 •

edited

Loading