From 754b13189f314ca65ee649588c94ed32c17aeb39 Mon Sep 17 00:00:00 2001 From: Antoine Beyeler Date: Mon, 14 Oct 2024 17:43:42 +0200 Subject: [PATCH 1/4] Add a How-to guide for the dataframe API --- docs/content/howto/dataframe-api.md | 214 ++++++++++++++++++++++++++++ 1 file changed, 214 insertions(+) create mode 100644 docs/content/howto/dataframe-api.md diff --git a/docs/content/howto/dataframe-api.md b/docs/content/howto/dataframe-api.md new file mode 100644 index 000000000000..9d8770520d95 --- /dev/null +++ b/docs/content/howto/dataframe-api.md @@ -0,0 +1,214 @@ +--- +title: Get data out from Rerun with code +order: 1600 +--- + +Rerun 0.19 added the Dataframe API to its SDK, with enables getting data out of Rerun from code. This page provides an overview of the API, as well as recipes to load the data in popular packages such as [Pandas](https://pandas.pydata.org), [Polars](https://pola.rs), and [DuckDB](https://duckdb.org). + + + +## The Dataframe API + +### Loading a recording + +A recording can be loaded from a RRD using the `load_recording()` function: + +```python +import rerun as rr + +recording = rr.dataframe.load_recording("/path/to/file.rrd") +``` + +Although RRD files generally contain a single recording, they may occasionally contain 2 or more. This can happen for example if the RRD includes a blueprint, which is stored as a recording that is separate from the data. + +For such RRD, the `load_archive()` function can be used: + + + +```python +import rerun as rr + +archive = rr.dataframe.load_archive("/pat/to/file.rrd") + +print(f"The archive contains {archive.num_recordings()} recordings.") + +for recording in archive.all_recordings(): + ... +``` + + +The overall content of the recording can be inspected using the `schema()` method: + +```python +schema = recording.schema() +schema.index_columns() # list of all index columns (timelines) +schema.component_columns() # list of all component columns +``` + + +### Creating a view + +The first step for getting data out of a recording is to create a view, which requires specifying an index column and some content to include. + +As of Rerun 0.19, views must have exactly one index column, which can be any of the recording timelines. Each row of the view will correspond to a unique value of the index column. A `null` value is possible, and corresponds to data logged as static. In the future, it will be possible to have other kinds of column as index, and more than a single index column. + +The content defines which columns are included in the view and can be flexibly specified as entity filters, optionally providing a corresponding list of components. + +These are all valid ways to specify view content: + +```python +# everything in the recording +view = recording.view(index="frame_nr", contents="/**") + +# all `Scalar` components in the recording +view = recording.view(index="frame_nr", contents={"/**": ["Scalar"]}) + +# some components in an entity subtree and a specific component +# of a specific entity +view = recording.view(index="frame_nr", contents={ + "/world/robot/**": ["Position3D", "Color"], + "/world/scene": ["Text"], +}) +``` + +### Filtering rows in a view + +A view has several APIs to further filter the rows it will return. + + + +**Filtering by time range** + +Rows may be filtered to keep only a given range of values from its index column: + +```python +# only keep rows for frames 0 to 10 +view = view.filter_range_sequence(0, 10) +``` + +This API exists for both temporal and sequence timeline, and for various units: +- `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments) +- `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments) +- `view.fiter_range_nanos(start_nano, end_nano)` (takes `int` arguments) + +**Filtering by index value** + +Rows may be filtered to keep only those whose index corresponds to a specific set of value: + +```python +view = view.filter_index_values([0, 5, 10]) +``` + +Note that a precise match is required. Since Rerun internally stores times as `int64`, this API is only available for integer arguments (nanos or sequence number). Floating point seconds would risk false mismatch due to numerical conversion. + + +**Filtering by column not null** + +Rows where a specific column has null values may be filtered out using the `filter_is_not_null()` method. When using this method, only rows for which a logging event exist for the provided column are returned. + +```python +# only keep rows where a position is available for the robot +view = view.filter_is_not_null(rr.dataframe.ComponentColumnSelector("/world/robot", "Position3D")) +``` + +### Specifying rows + +Instead of filtering rows based on the existing data, it is possible to specify exactly which rows must be returned by the view using the `using_index_values()` method: + +```python +# resample the first second of data at every millisecond +view = view.using_index_values(range(0, 1_000_000, 1_000_0000_000)) +``` + +In this case, the view will return rows in multiples of 1e6 nanoseconds (i.e. for each millisecond) over a period of one second. A precise match on the index value is required for data to be produced on the row. For this reason, a floating point second API is again not provided for this feature. + +Note that this feature is typically used in conjunction with `fill_latest_at()` (see next paragraph) to enable arbitrary resampling of the original data. + + +### Filling empty values with latest-at data + +By default, the rows returned by the view may be sparse and contain values only for the columns where a logging event actually occurred at the corresponding index value. The view can optionally replace these empty cells using a latest-at query. This means that, for each such empty cell, the view traces back to find the last logged value and uses it instead. This is enabled by calling the `fill_latest_at()` method: + +```python +view = view.fill_latest_at() +``` + +### Reading the data + +Once the view is fully set up (possibly using the filtering features previously described), its content can be read using the `select()` method. This method optionally allows specifying which subset of columns should be produced: + + +```python +# select all columns +record_batches = view.select() + +# select only the specified columns +record_batches = view.select( + [ + rr.dataframe.IndexColumnSelector("frame_nr"), + rr.dataframe.ComponentColumnSelector("/world/robot", "Position3D"), + ], +) +``` + +The `select()` method returns a [`pyarrow.RecordBatchReader`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), which is essentially an iterator over a stream of [`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow-recordbatch)es containing the actual data. See the [PyArrow documentation](https://arrow.apache.org/docs/python/index.html) for more information. + +In the rest of this page, we explore how these `RecordBatch`es can be ingested in some of the popular data science packages. + + +## Load data to a PyArrow `Table` + +The `RecordBatchReader` provides a [`read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) method which directly produces a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table): + +```python +import rerun as rr + +recording = rr.dataframe.load_recording("/path/to/file.rrd") +view = recording.view(index="frame_nr", contents="/**") + +table = view.select().read_all() +``` + + +## Load data to a Pandas dataframe + +The `RecordBatchReader` provides a [`read_pandas()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_pandas) method which returns a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html): + + +```python +import rerun as rr + +recording = rr.dataframe.load_recording("/path/to/file.rrd") +view = recording.view(index="frame_nr", contents="/**") + +df = view.select().read_pandas() +``` + +## Load data to a Polars dataframe + +A [Polars dataframe](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html) can be created from a PyArrow table: + +```python +import rerun as rr +import polars as pl + +recording = rr.dataframe.load_recording("/path/to/file.rrd") +view = recording.view(index="frame_nr", contents="/**") + +df = pl.from_arrow(view.select().read_all()) +``` + + +## Load data to a DuckDB relation + +A [DuckDB](https://duckdb.org) relation can be created directly using the `pyarrow.RecordBatchReader` returned by `select()`: + +```python +import rerun as rr +import duckdb + +recording = rr.dataframe.load_recording("/path/to/file.rrd") +view = recording.view(index="frame_nr", contents="/**") + +rel = duckdb.arrow(view.select()) +``` \ No newline at end of file From 33d48f8687bb25aa477e5ffabcdf1ef5ffcee8c4 Mon Sep 17 00:00:00 2001 From: Antoine Beyeler Date: Mon, 14 Oct 2024 19:10:00 +0200 Subject: [PATCH 2/4] Lint + Katya's comments --- docs/content/howto.md | 2 ++ docs/content/howto/dataframe-api.md | 17 ++++++++++------- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/docs/content/howto.md b/docs/content/howto.md index 9b7e8da9890d..a06c0386fa8d 100644 --- a/docs/content/howto.md +++ b/docs/content/howto.md @@ -16,3 +16,5 @@ Guides for using Rerun in more advanced ways. - [By logging custom data](howto/extend/custom-data.md) - [By implementing custom visualizations (Rust only)](howto/extend/extend-ui.md) - [Efficiently log time series data using `send_columns`](howto/send_columns.md) + - [Get data out from Rerun with code](howto/dataframe-api.md) + \ No newline at end of file diff --git a/docs/content/howto/dataframe-api.md b/docs/content/howto/dataframe-api.md index 9d8770520d95..442256a5a96d 100644 --- a/docs/content/howto/dataframe-api.md +++ b/docs/content/howto/dataframe-api.md @@ -7,7 +7,7 @@ Rerun 0.19 added the Dataframe API to its SDK, with enables getting data out of -## The Dataframe API +## The dataframe API ### Loading a recording @@ -19,7 +19,7 @@ import rerun as rr recording = rr.dataframe.load_recording("/path/to/file.rrd") ``` -Although RRD files generally contain a single recording, they may occasionally contain 2 or more. This can happen for example if the RRD includes a blueprint, which is stored as a recording that is separate from the data. +Although RRD files generally contain a single recording, they may occasionally contain 2 or more. This can happen, for example, if the RRD includes a blueprint, which is stored as a recording that is separate from the data. For such RRD, the `load_archive()` function can be used: @@ -52,7 +52,7 @@ The first step for getting data out of a recording is to create a view, which re As of Rerun 0.19, views must have exactly one index column, which can be any of the recording timelines. Each row of the view will correspond to a unique value of the index column. A `null` value is possible, and corresponds to data logged as static. In the future, it will be possible to have other kinds of column as index, and more than a single index column. -The content defines which columns are included in the view and can be flexibly specified as entity filters, optionally providing a corresponding list of components. +The content defines which columns are included in the view and can be flexibly specified as entity expression, optionally providing a corresponding list of components. These are all valid ways to specify view content: @@ -60,6 +60,9 @@ These are all valid ways to specify view content: # everything in the recording view = recording.view(index="frame_nr", contents="/**") +# everything in the recording, except the /world/robot subtree +view = recording.view(index="frame_nr", contents="/**\n- /world/robot/**") + # all `Scalar` components in the recording view = recording.view(index="frame_nr", contents={"/**": ["Scalar"]}) @@ -75,7 +78,7 @@ view = recording.view(index="frame_nr", contents={ A view has several APIs to further filter the rows it will return. - + **Filtering by time range** @@ -170,7 +173,7 @@ table = view.select().read_all() ``` -## Load data to a Pandas dataframe +## Load data to a Pandas dataframe The `RecordBatchReader` provides a [`read_pandas()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_pandas) method which returns a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html): @@ -184,7 +187,7 @@ view = recording.view(index="frame_nr", contents="/**") df = view.select().read_pandas() ``` -## Load data to a Polars dataframe +## Load data to a Polars dataframe A [Polars dataframe](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html) can be created from a PyArrow table: @@ -211,4 +214,4 @@ recording = rr.dataframe.load_recording("/path/to/file.rrd") view = recording.view(index="frame_nr", contents="/**") rel = duckdb.arrow(view.select()) -``` \ No newline at end of file +``` From 7b858da9d2463a87714f1d7c04d70bf9bee156c9 Mon Sep 17 00:00:00 2001 From: Antoine Beyeler Date: Mon, 14 Oct 2024 20:43:15 +0200 Subject: [PATCH 3/4] Typo --- docs/content/howto/dataframe-api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/howto/dataframe-api.md b/docs/content/howto/dataframe-api.md index 442256a5a96d..9e6b5efb4057 100644 --- a/docs/content/howto/dataframe-api.md +++ b/docs/content/howto/dataframe-api.md @@ -92,7 +92,7 @@ view = view.filter_range_sequence(0, 10) This API exists for both temporal and sequence timeline, and for various units: - `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments) - `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments) -- `view.fiter_range_nanos(start_nano, end_nano)` (takes `int` arguments) +- `view.filter_range_nanos(start_nano, end_nano)` (takes `int` arguments) **Filtering by index value** From 5a4c1d44e2d238a03ec7f4b134ae430d541349c9 Mon Sep 17 00:00:00 2001 From: gavrelina Date: Tue, 15 Oct 2024 15:08:49 +0200 Subject: [PATCH 4/4] testing small headers fix --- docs/content/howto/dataframe-api.md | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/docs/content/howto/dataframe-api.md b/docs/content/howto/dataframe-api.md index 9e6b5efb4057..a98458c532de 100644 --- a/docs/content/howto/dataframe-api.md +++ b/docs/content/howto/dataframe-api.md @@ -23,8 +23,8 @@ Although RRD files generally contain a single recording, they may occasionally c For such RRD, the `load_archive()` function can be used: - + ```python import rerun as rr @@ -35,6 +35,7 @@ print(f"The archive contains {archive.num_recordings()} recordings.") for recording in archive.all_recordings(): ... ``` + The overall content of the recording can be inspected using the `schema()` method: @@ -45,7 +46,6 @@ schema.index_columns() # list of all index columns (timelines) schema.component_columns() # list of all component columns ``` - ### Creating a view The first step for getting data out of a recording is to create a view, which requires specifying an index column and some content to include. @@ -80,7 +80,7 @@ A view has several APIs to further filter the rows it will return. -**Filtering by time range** +#### Filtering by time range Rows may be filtered to keep only a given range of values from its index column: @@ -90,11 +90,12 @@ view = view.filter_range_sequence(0, 10) ``` This API exists for both temporal and sequence timeline, and for various units: -- `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments) -- `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments) -- `view.filter_range_nanos(start_nano, end_nano)` (takes `int` arguments) -**Filtering by index value** +- `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments) +- `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments) +- `view.filter_range_nanos(start_nano, end_nano)` (takes `int` arguments) + +#### Filtering by index value Rows may be filtered to keep only those whose index corresponds to a specific set of value: @@ -104,8 +105,7 @@ view = view.filter_index_values([0, 5, 10]) Note that a precise match is required. Since Rerun internally stores times as `int64`, this API is only available for integer arguments (nanos or sequence number). Floating point seconds would risk false mismatch due to numerical conversion. - -**Filtering by column not null** +##### Filtering by column not null Rows where a specific column has null values may be filtered out using the `filter_is_not_null()` method. When using this method, only rows for which a logging event exist for the provided column are returned. @@ -127,7 +127,6 @@ In this case, the view will return rows in multiples of 1e6 nanoseconds (i.e. fo Note that this feature is typically used in conjunction with `fill_latest_at()` (see next paragraph) to enable arbitrary resampling of the original data. - ### Filling empty values with latest-at data By default, the rows returned by the view may be sparse and contain values only for the columns where a logging event actually occurred at the corresponding index value. The view can optionally replace these empty cells using a latest-at query. This means that, for each such empty cell, the view traces back to find the last logged value and uses it instead. This is enabled by calling the `fill_latest_at()` method: @@ -140,7 +139,6 @@ view = view.fill_latest_at() Once the view is fully set up (possibly using the filtering features previously described), its content can be read using the `select()` method. This method optionally allows specifying which subset of columns should be produced: - ```python # select all columns record_batches = view.select() @@ -158,7 +156,6 @@ The `select()` method returns a [`pyarrow.RecordBatchReader`](https://arrow.apac In the rest of this page, we explore how these `RecordBatch`es can be ingested in some of the popular data science packages. - ## Load data to a PyArrow `Table` The `RecordBatchReader` provides a [`read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) method which directly produces a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table): @@ -172,12 +169,10 @@ view = recording.view(index="frame_nr", contents="/**") table = view.select().read_all() ``` - ## Load data to a Pandas dataframe The `RecordBatchReader` provides a [`read_pandas()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_pandas) method which returns a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html): - ```python import rerun as rr @@ -201,7 +196,6 @@ view = recording.view(index="frame_nr", contents="/**") df = pl.from_arrow(view.select().read_all()) ``` - ## Load data to a DuckDB relation A [DuckDB](https://duckdb.org) relation can be created directly using the `pyarrow.RecordBatchReader` returned by `select()`: