Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document/Improve: Continuous batching/chunking of a stream #118

Open
jmealo opened this issue Mar 21, 2021 · 4 comments
Open

Document/Improve: Continuous batching/chunking of a stream #118

jmealo opened this issue Mar 21, 2021 · 4 comments

Comments

@jmealo
Copy link

jmealo commented Mar 21, 2021

I'm creating this issue to discuss solutions (documentation or otherwise) to help folks avoid some of the non-memory related pitfalls on #116 when continuously batching COPY commands.

See:
#116 (comment)

@jeromew: I found out what happens if you always have a COPY in progress. It works great until something finally manages to get a lock on the table and then you have a bunch of COPY operations pending. From a locking perspective, it's really not ideal to start a COPY, stream data into it, close that stream, start a new COPY stream an immediately pipe into it, you're effectively never yielding the table for an exclusive lock.

🔥 If you do not give PostgreSQL some breathing room between COPY operations; you're going to run into 🔒 locking issues 🔒.

I'm interested if anyone has any ideas for the best ways to handle this entirely. I'm going to dump some tips for anyone working on this in the future.

Pro-tip: Safely create async batches

Under high load, a new event may arrive before the previous event's call to getBatch() resolves. This can lead to orphaned batches (memory leak) and lost messages.

const getBatch = async () => {
  // If another batch is already pending; we return the promise that will resolve to that batch.
  // You can .then() or await the same promise multiple times; you cannot resolve()/reject() multiple values
  if (nextBatch) {
    return nextBatch
  }

  try {
    nextBatch = producer.createBatch({ maxSizeInBytes })
    const batch = await nextBatch
    return batch
  } finally {
    nextBatch = undefined
  }
}

Anti-pattern: Immediately unpipe/piping another stream into rotating COPY streams

// Ooops! If you immediately rotate streams you will always have a lock on the table (until you don't and then COPY will pile up in a `WAITING` state, especially if you have control/metadata queries interleaved between)
const rotateCopyStream = async (logStream) => {
  // disconnecting the csv stream from our log stream, causes log stream to pause
  if (csvStream) logStream.unpipe(csvStream)

  if (pgStream) {
    // disconnecting the pg stream from the csv stream causes the csv stream to end
    csvStream.unpipe(pgStream)
    pgStream.end() // calling this ends the COPY command
  }

  // Get connection and execute our copy command
  const client = await pool.connect()
  pgStream = client.query(copyFrom(COPY_QUERY))

  // Debugging helps
  pgStream.on('error', (error) => {
    if (client) client.release()
    console.log('pgSteam -> stream errored')
    console.log(error)
  })

  pgStream.on('finish', () => {
    if (client) client.release()
    console.log('pgStream -> stream complete')
  })

  csvStream = format({headers: false, includeEndRowDelimiter: true, delimiter: '\t'})
  logStream.pipe(csvStream, {end: false}).pipe(pgStream)
}

@jbtobar: I'm using 100% streams in my workflow, but I've found a Node.js foot canon when batching things coming from an event emitter/event handler/subscription. If batch creation is async, you need to make sure if a second event comes in while the batches are being rotated, that you don't end up creating two or more batches. An example of how to accomplish this is below:

I considered putting an advisory lock at the beginning of the COPY command. This would give us a chance to keep our input stream paused and resume flow when the last COPY is complete with the possibility of application-level logic. I haven't tried it yet, but, that's my next plan of attack.

@jeromew
Copy link
Collaborator

jeromew commented Aug 20, 2021

Regarding the "antipattern" that is mentioned in #118 (comment)

pgStream.end() // calling this ends the COPY command

Calling end() only starts the process of ending the COPY operation (it sends CopyDone). You probably should not call

pgStream = client.query(copyFrom(COPY_QUERY))

and pipe into the new rotated stream until the previous stream emits 'finish'.

In the code, what happens is :

COPY -> COMMITING
     -> COPY

when what you want in order to leave more breathing room to the backend may be

COPY -> COMMITING -> COPY

@jeromew
Copy link
Collaborator

jeromew commented Aug 20, 2021

Just to make it clear this issue is for discussing a pattern that attracts some people because of the possibilities given by pg-copy-stream

Some people want to create a long-lived ingestion mechanism to ingest data into postgresql. For example, you could imagine that you have a never ending stream of logs that you want to ingest into postgresql.

this could be done by

  1. inserting log lines one by one via INSERT commands
  2. inserting batched volumes of log lines (let's say 100s) via MULTI-INSERT commands. cf "insert multiple rows" on https://www.postgresql.org/docs/14/dml-insert.html

there are probably other ways to do that with postgresql, including mechanisms linked with replication.

but still it can be interesting to see if and how COPY can help for this kind of scenarios.

first it is necessary to understand that a COPY FROM operation ingest rows. During the operation, the rows are not visible. They will become visible only once the COPY operation is terminated.

So pg-copy-stream cannot be used to open a long-lived stream, push data into it, and expect that this data will be queryable. For the data to be visible and queryable, you need to finish the COPY operation.

Once this is said, the idea of rotated COPY operations appear : ingest N rows via COPY and then create a new COPY operation.

It could be interesting to agree on advantages and disadvantages of the different solutions that can be used to do that and see how it impacts the performance of postgresql depending on the throughput of the ingested data, and how the throughput of out-of-band queries that need to be executed on the data.

I will leave this issue open for several months to see if people want to share their experience, have ideas around this, and discuss this further.

@jmealo
Copy link
Author

jmealo commented Apr 26, 2024

@jeromew: Howdy, here I am bulk ingesting data into PostgreSQL again 😃 ... Multi-INSERT commands seems slow compared to copy but I haven't benchmarked it yet.

I'm realizing that a library that wraps pg-copy-streams that adds the logic for safely handling batches of bulk inserts, that flush for visibility and explains everything this issue does could be a good compromise.

If there was any interest in linking to it from this library, I'd gladly consider implementing it. Thoughts? @jeromew @brianc

@jeromew
Copy link
Collaborator

jeromew commented May 3, 2024

Hello,

I understand the need to find a nice streaming ingestion mechanism for long-lived streams and mega-huge ingested filed with regular visibility on the data beeing ingested (using one COPY operation only gives visibility after the COPY is commited).

it would indeed be interesting to benchmark COPY vs Multi-Insert.

by Multi-Insert for N rows, I do not mean N INSERT operations, "1 insert per row", but I mean the ability that postgres has to insert several rows as a batch in one operation - cf https://www.postgresql.org/docs/current/dml-insert.html and thus send 1 INSERT with N rows

You can insert multiple rows in a single command:

INSERT INTO products (product_no, name, price) VALUES
    (1, 'Cheese', 9.99),
    (2, 'Bread', 1.99),
    (3, 'Milk', 2.99);

The TIP on this page says

When inserting a lot of data at the same time, consider using the COPY command. It is not as flexible as the INSERT command, but is more efficient.

I fully agree for 1 COPY versus INSERTs but if the regular visibility is important to you I would be surprised that the time spent setting up regular COPY operation for N rows will outperform a regular INSERT with N rows, and even more so if the INSERT N rows is a prepared statement

A specificity also when using COPY is that copy-from does not require the incoming stream to be respectful of row boundaries (Postgres has everything it needs to detect row boundaries and reconstruct rows from streamed network chunks) but in your solution you would need to either add a row boundary detection mechanism or require incoming streams to be respectful of row boundaries.

Note also that if the stream is an object stream with 1 row per chunk, it is important performance-wise to recombine the rows into bigger network chunks as per #127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants