New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: add partial param v1 #25126

Open

JeanArhancet wants to merge 6 commits into influxdata:main from JeanArhancet:feat/add-partial-param-v1

Contributor

JeanArhancet commented Jul 7, 2024 •

edited

Loading

Closes #25014

This pull request includes the following changes:

Adds the "Transfer-Encoding: chunked" header if the response is chunked.
Adds support for the partial parameter in the JSON response.

Check-list:

I've read the contributing section of the project README.
Signed CLA (if not already signed).


          feat: add partial param v1

5b0e339

hiltontj reviewed

View reviewed changes

Contributor

hiltontj left a comment

I think this is not quite right. I have tried out the OSS version to see the behaviour of the partial parameter, and here is what I get:

curl -G 'http://localhost:8086/query?db=mydb&chunked=true&chunk_size=2' --data-urlencode 'q=SELECT * FROM "mymeas"'

{"results":[{"statement_id":0,"series":[{"name":"mymeas","columns":["time","myfield","mytag"],"values":[["2016-05-19T18:37:55Z",90,"1"],["2016-05-19T18:37:56Z",90,"1"]],"partial":true}],"partial":true}]}
{"results":[{"statement_id":0,"series":[{"name":"mymeas","columns":["time","myfield","mytag"],"values":[["2016-05-19T18:37:57Z",90,"1"],["2016-05-19T18:37:58Z",90,"1"]]}]}]}

You can see "partial": true only appears on the first line. It indicates whether there are more data points for that series, or statement, respectively (but since we only will support single statements, I think we would only ever see it appear in both places, or neither).

It does not appear in the second line because there are no more data points after that line.

In some of your tests, it is working as intended, but in others, you can see that it is appearing on lines where it should not, i.e., lines after which there are no more data points for that series. This might be because you are using the can_flush function to determine the value. I believe that function just indicates if it is ready to flush data to the output stream, based on data in the buffer, the chunk size specified, etc.

This requirement may be difficult to satisfy, the closest I can think of is by using a subset of the check that can_flush does, by just checking if the ChunkBuffer's series only has a single entry: that would indicate that it is still working on the current series, but I am not sure that would produce the desired result in all cases.

Contributor Author

JeanArhancet commented Jul 8, 2024

@hiltontj Thanks for your feedback.

Do you have an example in the tests where the behavior is not as expected? From what I see, the partial is set to true for the first line but not for the second.

hiltontj reviewed

View reviewed changes

Contributor

hiltontj left a comment

@JeanArhancet - should have done this in my original review, but here are comments in line for each test, based on my understanding of how the partial response parameter should work.

influxdb3/tests/server/query.rs

Comment on lines +1006 to +1011

+                                        ],
+                                        "partial": true
                                       }
                                     ],
-                                    "statement_id": 0
+                                    "statement_id": 0,
+                                    "partial": true

Contributor

hiltontj Jul 8, 2024

In this test case, it worked correctly ✅

influxdb3/tests/server/query.rs Outdated

Comment on lines 1049 to 1054

+                                        ],
+                                        "partial": true
                                       }
                                     ],
-                                    "statement_id": 0
+                                    "statement_id": 0,
+                                    "partial": true

Contributor

hiltontj Jul 8, 2024

In this test case it did not work correctly, because there are no more results for the cpu measurement. I think in this case, the second partial: true should be there, i.e., for the statement, but not the first, i.e., for the measurement.

influxdb3/tests/server/query.rs

Comment on lines +1093 to +1098

+                                        ],
+                                        "partial": true
                                       }
                                     ],
-                                    "statement_id": 0
+                                    "statement_id": 0,
+                                    "partial": true

Contributor

hiltontj Jul 8, 2024

This one is correct ✅

influxdb3/tests/server/query.rs Outdated

Comment on lines 1111 to 1112

		],
		"partial": true

Contributor

hiltontj Jul 8, 2024

This one is not correct, as there are no more records for cpu.

influxdb3/tests/server/query.rs

Comment on lines +1115 to +1116

		"statement_id": 0,
		"partial": true

Contributor

hiltontj Jul 8, 2024

This one is correct, as there are more records for the statement.

influxdb3/tests/server/query.rs

Comment on lines +1130 to +1135

+                                        ],
+                                        "partial": true
                                       }
                                     ],
-                                    "statement_id": 0
+                                    "statement_id": 0,
+                                    "partial": true

Contributor

hiltontj Jul 8, 2024

Both of these are incorrect, as there are no more records for the measurement or statement.

Contributor Author

JeanArhancet commented Jul 12, 2024

@hiltontj I believe I've fixed the issue. Could you please review the changes and let me know your thoughts?


          fix: partial field in series

7e610e7

JeanArhancet force-pushed the feat/add-partial-param-v1 branch from 081166a to 7e610e7 Compare

July 12, 2024 10:17

Contributor

hiltontj commented Jul 12, 2024

@JeanArhancet - it looks like you've made it so the existing test exhibits the correct behaviour, but I am not convinced that this will work properly for all cases.

The reason for my concern is that in this PR, the partial parameter's value is always being determined by what is currently in the buffer, but should also be factoring in what has yet to be buffered. That is, what the input RecordBatchStream will yield on the next call to poll_next.

I wonder if using a peekable Stream would give a solution.

Contributor Author

JeanArhancet commented Jul 12, 2024

@JeanArhancet - it looks like you've made it so the existing test exhibits the correct behaviour, but I am not convinced that this will work properly for all cases.

The reason for my concern is that in this PR, the partial parameter's value is always being determined by what is currently in the buffer, but should also be factoring in what has yet to be buffered. That is, what the input RecordBatchStream will yield on the next call to poll_next.

I wonder if using a peekable Stream would give a solution.

Thanks for the clarification. I'm not entirely sure I understand correctly. Are both partial_results and partial_series incorrect in their current logic? Should I convert the QueryResponseStream input from Fuse<SendableRecordBatchStream> to Peekable<Fuse<SendableRecordBatchStream>> to address this issue?

Contributor

hiltontj commented Jul 12, 2024

Thanks for the clarification. I'm not entirely sure I understand correctly. Are both partial_results and partial_series incorrect in their current logic?

I think they are correct with respect to the data that has been buffered, but because they don't factor in what is still in the Stream, then in the situation where there is a query that produces a stream of many RecordBatches, they may not be correct.

The current tests are at the integration level and don't really give you a lot of control over the underlying RecordBatch stream. You may be better off adding one or more unit tests where you can structure the size and number of batches in the stream.

Should I convert the QueryResponseStream input from Fuse<SendableRecordBatchStream> to Peekable<Fuse<SendableRecordBatchStream>> to address this issue?

Yes, what I was hinting at would involve doing this. I think trying to set up a unit test where you can control the RecordBatches in the stream and better confirm that it is working as intended (or not) would be a more useful first step before trying to implement this though.

Contributor

hiltontj commented Jul 12, 2024 •

edited

Loading

Re: the unit test suggestion, the iter function would be an easy way to manually set up a stream of RecordBatches from a Vec<RecordBatch>.

Edit: then you could adapt that into a SendableRecordBatchStream to create a QueryResponseStream for your tests.


          test: add test partial flag

2cbde5f

Contributor Author

JeanArhancet commented Jul 15, 2024

@hiltontj I've added a unit test in the latest commit. The test fails because the latest response contains the partial flag set to true, but it should be set to none since it's the last response.

hiltontj reviewed

View reviewed changes

Contributor

hiltontj left a comment

@JeanArhancet - this is a great start, I left a comment in-line.

influxdb3_server/src/http/v1.rs Outdated

+                  async fn test_partial_flag() {
+                      let batch = create_test_record_batch();
+                      let schema = batch.schema();
+                      let input_stream = stream::iter(vec![Ok(batch.clone())]);

Contributor

hiltontj Jul 16, 2024

Your input stream should have more than one record batch, as that will test the behaviour of the partial parameter when there are batches remaining in the stream, yet to be seen with poll_next.

If you can augment your create_test_record_batch method to generate batches for different measurements and with different sizes, then you could come up with and test different kinds of input streams, e.g.,

┌──────────────┐  ┌──────────────┐  ┌──────────────┐ 
│ RecordBatch0 │  │ RecordBatch1 │  │ RecordBatch2 │
│ meas: 'cpu'  │  │ meas: 'cpu'  │  │ meas: 'mem'  │
│ rows: 2      │  │ rows: 3      │  │ rows: 2      │
└──────────────┘, └──────────────┘, └──────────────┘

With the chunk size of 2, this should give the following chunks:

┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
│ Chunk0               │  │ Chunk1               │  │ Chunk2               │  │ Chunk3               │
│ meas: 'cpu'          │  │ meas: 'cpu'          │  │ meas: 'cpu'          │  │ meas: 'mem'          │
│ rows: 2              │  │ rows: 2              │  │ rows: 1              │  │ rows: 2              │  
│ partial(meas): true  │  │ partial(meas): true  │  │ partial(meas): false │  │ partial(meas): false │
│ partial(stmt): true  │  │ partial(stmt): true  │  │ partial(stmt): true  │  │ partial(stmt): false │
└──────────────────────┘, └──────────────────────┘, └──────────────────────┘, └──────────────────────┘

Contributor Author

JeanArhancet Jul 16, 2024

Thanks for your suggestion. I have updated the test to align with that. With this new test, we can see that my implementation is not correct.

Contributor

hiltontj Jul 17, 2024

Yes, I think that in order to make this work, the streaming/buffering logic needs to be refactored/rethought a bit. It would be great if you are willing to take a stab at it!

I updated the original issue description to make a note of this.

Contributor Author

JeanArhancet Jul 25, 2024 •

edited

Loading

@hiltontj I tried to solve this issue ( see commit 25f05a6) but was not successful.

Contributor

hiltontj Jul 26, 2024

Hmm, perhaps peekable is not the way to go. I wonder if the streaming logic, i.e., in the impl Stream for QueryResponseStream just needs to be re-ordered.

Right now, it always checks if it can flush before polling the input stream, which means that it could flush the entire buffer before polling the input stream again, and which seems to be the cause of the problem we are up against here.

If, instead, we just polled the input before flushing, and then, when flushing, always leave at most one chunk in the buffer. That way, there is always enough data in the buffer, when streaming chunks, to decide what partial should be.

The exception would be, when the input stream is done, and there are no more record batches, then you would want to flush the entire buffer.

Contributor Author

JeanArhancet Jul 28, 2024

@hiltontj Thanks for your feedback and description, but I'm a bit lost concerning the correct implementation and what I need to refactor. I don't think I will find the right approach 😅

Contributor Author

JeanArhancet Aug 1, 2024

@hiltontj I've just pushed another unit test to verify the existence of the partial flag when the stream size is 1


          test: refactor to use multiple recordbatch

6ad9462

JeanArhancet force-pushed the feat/add-partial-param-v1 branch from c4f30e8 to 6ad9462 Compare

July 16, 2024 21:39


          refactor: use peekable

25f05a6

JeanArhancet force-pushed the feat/add-partial-param-v1 branch 2 times, most recently from c30d160 to ab971ce Compare

August 1, 2024 07:44


          test: add another test

53a1bb6

JeanArhancet force-pushed the feat/add-partial-param-v1 branch from ab971ce to 53a1bb6 Compare

August 1, 2024 14:27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet