Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "Getting started" guide for the dataframe API #7643

Merged
merged 24 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
53c9586
WIP code + doc skeleton
abey79 Oct 8, 2024
fc2df91
WIP split in three parts
abey79 Oct 9, 2024
70a689a
WIP
abey79 Oct 11, 2024
9593d71
First complete draft of section 2 and 3
abey79 Oct 11, 2024
5c68542
First complete draft of part 1 as well
abey79 Oct 13, 2024
969119d
Finalized all the text
abey79 Oct 14, 2024
89a504d
Lints
abey79 Oct 14, 2024
8b3d579
Move the script to the right place and opt out of running it
abey79 Oct 14, 2024
d62048c
Actually correct location for the script
abey79 Oct 14, 2024
603e4c6
Use vp8 version of the videos instead
abey79 Oct 14, 2024
1c4fb44
testing small headers fix
gavrelina Oct 15, 2024
05c7805
super nits for data-out.md
zehiko Oct 16, 2024
d038133
missing comma
zehiko Oct 16, 2024
5ac3158
Update docs/content/getting-started/data-out/analyze-and-log.md
zehiko Oct 16, 2024
d1e1777
Update docs/content/getting-started/data-out.md
zehiko Oct 16, 2024
9af8e45
Update docs/content/getting-started/data-out/analyze-and-log.md
zehiko Oct 16, 2024
beeeb52
better framing for pandas popularity
zehiko Oct 16, 2024
be1fc83
nit wording update
zehiko Oct 16, 2024
07ccd02
Merge remote-tracking branch 'origin/main' into antoine/data-out-tuto…
Wumpf Oct 16, 2024
a774e4a
improve `Log the result back to the viewer` wording
Wumpf Oct 16, 2024
d6192d3
small fix on `Load the recording`
Wumpf Oct 16, 2024
3efb486
import np at the beginning of the data-out series
Wumpf Oct 16, 2024
0f0868f
put getting data out under data in
Wumpf Oct 16, 2024
8fc1153
pandas capitalization
Wumpf Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/content/getting-started/data-out.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Get data out of Rerun
order: 450
---

At its core, Rerun is a database. The viewer includes the [dataframe view](../reference/types/views/dataframe_view) to explore data in tabular form, and the SDK includes an API to export the data as dataframes from the recording. These features can be used, for example, to perform analysis on the data and log back the results to the original recording.

In this three-part guide, we explore such a workflow by implementing an "open jaw detector" on top of our [face tracking example](https://rerun.io/examples/video-image/face_tracking). This process is split into three steps:

1. [Explore a recording with the dataframe view](data-out/explore-as-dataframe)
2. [Export the dataframe](data-out/export-dataframe)
3. [Analyze the data and log the results](data-out/analyze-and-log)

Note: this guide uses the popular [Pandas](https://pandas.pydata.org) dataframe package. The same concept however applies in the same way for alternative dataframe packages such as [Polars](https://pola.rs).

If you just want to see the final result, jump to the [complete script](data-out/analyze-and-log.md#complete-script) at the end of the third section.
89 changes: 89 additions & 0 deletions docs/content/getting-started/data-out/analyze-and-log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: Analyze the data and log the results
order: 3
---



In the previous sections, we explored our data and exported it to a Pandas dataframe. In this section, we will analyze the data to extract a "jaw open state" signal and log it back to the viewer.



## Analyze the data

We already identified that thresholding the `jawOpen` signal at 0.15 is all we need to produce a binary "jaw open state" signal.

In the [previous section](export-dataframe.md#inspect-the-dataframe), we prepared a flat, floating point column with the signal of interest called `"jawOpen"`. Let's add a boolean column to our Pandas dataframe to hold our jaw open state:

```python
df["jawOpenState"] = df["jawOpen"] > 0.15
```


## Log the result back to the viewer

The first step is to initialize the logging SDK targeting the same recording we just analyzed.
This requires matching both the application ID and recording ID precisely.
By using the same identifiers, we're appending new data to an existing recording.
If the recording is currently open in the viewer (and it's listening for new connections), this approach enables us to seamlessly add the new data to the ongoing session.

```python
rr.init(
recording.application_id(),
recording_id=recording.recording_id(),
)
rr.connect()
```

_Note_: When automating data analysis, it is typically preferable to log the results to an distinct RRD file next to the source RRD (using `rr.save()`). In such a situation, it is also valid to use the same app ID and recording ID. This allows opening both the source and result RRDs in the viewer, which will display data from both files under the same recording.

We will log our jaw open state data in two forms:
1. As a standalone `Scalar` component, to hold the raw data.
2. As a `Text` component on the existing bounding box entity, such that we obtain a textual representation of the state in the visualization.

Here is how to log the data as a scalar:

```python
rr.send_columns(
"/jaw_open_state",
times=[rr.TimeSequenceColumn("frame_nr", df["frame_nr"])],
components=[
rr.components.ScalarBatch(df["jawOpenState"]),
],
)
```

With use the [`rr.send_column()`](../../howto/send_columns.md) API to efficiently send the entire column of data in a single batch.

Next, let's log the same data as `Text` component:

```python
target_entity = "/video/detector/faces/0/bbox"
rr.log_components(target_entity, [rr.components.ShowLabels(True)], static=True)
rr.send_columns(
target_entity,
times=[rr.TimeSequenceColumn("frame_nr", df["frame_nr"])],
components=[
rr.components.TextBatch(np.where(df["jawOpenState"], "OPEN", "CLOSE")),
Wumpf marked this conversation as resolved.
Show resolved Hide resolved
],
)
```

Here we first log the [`ShowLabel`](../../reference/types/components/show_labels.md) component as static to enable the display of the label. Then, we use `rr.send_column()` again to send an entire batch of text labels. We use the [`np.where()`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) to produce a label matching the state for each timestamp.
Wumpf marked this conversation as resolved.
Show resolved Hide resolved

### Final result

With some adjustments to the viewer blueprint, we obtain the following result:

<video width="100%" autoplay loop muted controls>
<source src="https://static.rerun.io/getting-started-data-out/data-out-final-vp8.webm" type="video/webm" />
</video>

The OPEN/CLOSE label is displayed along the bounding box on the 2D view, and the `/jaw_open_state` signal is visible in both the timeseries and dataframe views.


### Complete script

Here is the complete script used by this guide to load data, analyze it, and log the result back:

snippet: tutorials/data_out
72 changes: 72 additions & 0 deletions docs/content/getting-started/data-out/explore-as-dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
title: Explore a recording with the dataframe view
order: 1
---




In this first part of the guide, we run the [face tracking example](https://rerun.io/examples/video-image/face_tracking) and explore the data in the viewer.

## Create a recording

The first step is to create a recording in the viewer using the face tracking example. Check the [face tracking installation instruction](https://rerun.io/examples/video-image/face_tracking#run-the-code) for more information on how to run this example.

Here is such a recording:

<video width="100%" autoplay loop muted controls>
<source src="https://static.rerun.io/getting-started-data-out/data-out-first-look-vp8.webm" type="video/webm" />
</video>

A person's face is visible and being tracked. Their jaws occasionally open and close. In the middle of the recording, the face is also temporarily hidden and no longer tracked.


## Explore the data

Amongst other things, the [MediaPipe Face Landmark](https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker) package used by the face tracking example outputs so-called blendshapes signals, which provide information on various aspects of the face expression. These signals are logged under the `/blendshapes` root entity by the face tracking example.

One signal, `jawOpen` (logged under the `/blendshapes/0/jawOpen` entity as a [`Scalar`](../../reference/types/components/scalar.md) component), is of particular interest for our purpose. Let's inspect it further using a timeseries view:


<picture>
<img src="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/full.png" alt="">
<source media="(max-width: 480px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/data-out-jaw-open-signal/258f5ffe043b8affcc54d5ea1bc864efe7403f2c/1200w.png">
</picture>

This signal indeed seems to jump from approximately 0.0 to 0.5 whenever the jaws are open. We also notice a discontinuity in the middle of the recording. This is due to the blendshapes being [`Clear`](../../reference/types/archetypes/clear.md)ed when no face is detected.

Let's create a dataframe view to further inspect the data:

<picture>
<img src="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/full.png" alt="">
<source media="(max-width: 480px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/data-out-jaw-open-dataframe/bde18eb7b159e3ea1166a61e4a334eaedf2e04f8/1200w.png">
</picture>

Here is how this view is configured:
- Its content is set to `/blendshapes/0/jawOpen`. As a result, the table only contains columns pertaining to that entity (along with any timeline(s)). For this entity, a single column exists in the table, corresponding to entity's single component (a `Scalar`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call it entity path filter right now. Saying content here makes it hard to find in the ui

- The `frame_nr` timeline is used as index for the table. This means that the table will contain one row for each distinct value of `frame_nr` for which data is available.
- The rows can further be filtered by time range. In this case, we keep the default "infinite" boundaries, so no filtering is applied.
- The dataframe view has other advanced features which we are not using here, including filtering rows based on the existence of data for a given column, or filling empty cells with latest-at data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding code here for the blueprint would be amazing 💭

<!-- TODO(#7499): add link to more information on filter-is-not-null and fill with latest-at -->

Now, let's look at the actual data as represented in the above screenshot. At around frame #140, the jaws are open, and, accordingly, the `jawOpen` signal has values around 0.55. Shortly after, they close again and the signal decreases to below 0.1. Then, the signal becomes empty. This happens in rows corresponding to the period of time when the face cannot be tracked and all the signals are cleared.


## Next steps

Our exploration of the data in the viewer so far provided us with two important pieces of information useful to implement the jaw open detector.

First, we identified that the `Scalar` value contained in `/blendshapes/0/jawOpen` contains relevant data. In particular, thresholding this signal with a value of 0.15 should provide us with a closed/opened jaw state binary indicator.

Then, we explored the numerical data in a dataframe view. Importantly, the way we configured this view for our needs informs us on how to query the recording from code such as to obtain the correct output.

<!-- TODO(#7462): improve the previous paragraph to mention copy-as-code instead -->

From there, our next step is to query the recording and extract the data as a Pandas dataframe in Python. This is covered in the [next section](export-dataframe.md) of this guide.
204 changes: 204 additions & 0 deletions docs/content/getting-started/data-out/export-dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: Export the dataframe
order: 2
---


In the [previous section](explore-as-dataframe.md), we explored some face tracking data using the dataframe view. In this section, we will see how we can use the dataframe API of the Rerun SDK to export the same data into a [Pandas](https://pandas.pydata.org) dataframe to further inspect and process it.

## Load the recording

The dataframe SDK loads data from an .RRD file.
The first step is thus to save the recording as RRD, which can be done from the Rerun menu:

<picture style="zoom: 0.5">
<img src="https://static.rerun.io/save_recording/ece0f887428b1800a305a3e30faeb57fa3d77cd8/full.png" alt="">
<source media="(max-width: 480px)" srcset="https://static.rerun.io/save_recording/ece0f887428b1800a305a3e30faeb57fa3d77cd8/480w.png">
</picture>

We can then load the recording in a Python script as follows:

```python
import rerun as rr
import numpy as np # We'll need this later.

# load the recording
recording = rr.dataframe.load_recording("face_tracking.rrd")
```


## Query the data

Once we loaded a recording, we can query it to extract some data. Here is how it is done:

```python
# query the recording into a pandas dataframe
view = recording.view(
index="frame_nr",
contents="/blendshapes/0/jawOpen"
)
table = view.select().read_all()
```

A lot is happening here, let's go step by step:
1. We first create a _view_ into the recording. The view specifies which index column we want to use (in this case the `"frame_nr"` timeline), and which other content we want to consider (here, only the `/blendshapes/0/jawOpen` entity). The view defines a subset of all the data contained in the recording where each row has a unique value for the index, and columns are filtered based on the value(s) provided as `contents` argument.
2. A view can then be queried. Here we use the simplest possible form of querying by calling `select()`. No filtering is applied, and all view columns are selected. The result thus corresponds to the entire view.
3. The object returned by `select()` is a [`pyarrow.RecordBatchReader`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html). This is essentially an iterator that returns the stream of [`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow-recordbatch)es containing the query data.
4. Finally, we use the [`pyarrow.RecordBatchReader.read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) function to read all record batches as a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table).

**Note**: queries can be further narrowed by filtering rows and/or selecting a subset of the view columns. See the reference documentation for more information.

<!-- TODO(#7499): add a link to the reference documentation -->

Let's have a look at the resulting table:

```python
print(table)
```

Here is the result:
```
pyarrow.Table
frame_nr: int64
frame_time: timestamp[ns]
log_tick: int64
log_time: timestamp[ns]
/blendshapes/0/jawOpen:Scalar: list<item: double>
child 0, item: double
----
frame_nr: [[0],[1],...,[412],[413]]
frame_time: [[1970-01-01 00:00:00.000000000],[1970-01-01 00:00:00.040000000],...,[1970-01-01 00:00:16.480000000],[1970-01-01 00:00:16.520000000]]
log_tick: [[34],[92],...,[22077],[22135]]
log_time: [[2024-10-13 08:26:46.819571000],[2024-10-13 08:26:46.866358000],...,[2024-10-13 08:27:01.722971000],[2024-10-13 08:27:01.757358000]]
/blendshapes/0/jawOpen:Scalar: [[[0.03306490555405617]],[[0.03812221810221672]],...,[[0.06996039301156998]],[[0.07366073131561279]]]
```

Again, this is a [PyArrow](https://arrow.apache.org/docs/python/index.html) table which contains the result of our query. Further exploring Arrow structures is beyond the scope of this guide. Yet, it is a reminder that Rerun natively stores—and returns—data in arrow format. As such, it efficiently interoperates with other Arrow-native and/or compatible tools such as [Polars](https://pola.rs) or [DuckDB](https://duckdb.org).


## Create a Pandas dataframe

Before exploring the data further, let's convert the table to a Pandas dataframe:

```python
df = table.to_pandas()
```

Alternatively, the dataframe can be created directly, without using the intermediate PyArrow table:

```python
df = view.select().read_pandas()
```


## Inspect the dataframe

Let's have a first look at this dataframe:

```python
print(df)
```

Here is the result:

<!-- NOLINT_START -->

```
frame_nr frame_time log_tick log_time /blendshapes/0/jawOpen:Scalar
0 0 1970-01-01 00:00:00.000 34 2024-10-13 08:26:46.819571 [0.03306490555405617]
1 1 1970-01-01 00:00:00.040 92 2024-10-13 08:26:46.866358 [0.03812221810221672]
2 2 1970-01-01 00:00:00.080 150 2024-10-13 08:26:46.899699 [0.027743922546505928]
3 3 1970-01-01 00:00:00.120 208 2024-10-13 08:26:46.934704 [0.024137917906045914]
4 4 1970-01-01 00:00:00.160 266 2024-10-13 08:26:46.967762 [0.022867577150464058]
.. ... ... ... ... ...
409 409 1970-01-01 00:00:16.360 21903 2024-10-13 08:27:01.619732 [0.07283800840377808]
410 410 1970-01-01 00:00:16.400 21961 2024-10-13 08:27:01.656455 [0.07037288695573807]
411 411 1970-01-01 00:00:16.440 22019 2024-10-13 08:27:01.689784 [0.07556036114692688]
412 412 1970-01-01 00:00:16.480 22077 2024-10-13 08:27:01.722971 [0.06996039301156998]
413 413 1970-01-01 00:00:16.520 22135 2024-10-13 08:27:01.757358 [0.07366073131561279]

[414 rows x 5 columns]
```

<!-- NOLINT_END -->

We can make several observations from this output.

- The first four columns are timeline columns. These are the various timelines the data is logged to in this recording.
- The last columns is named `/blendshapes/0/jawOpen:Scalar`. This is what we call a _component column_, and it corresponds to the [Scalar](../../reference/types/components/scalar.md) component logged to the `/blendshapes/0/jawOpen` entity.
- Each row in the `/blendshapes/0/jawOpen:Scalar` column consists of a _list_ of (typically one) scalar.

This last point may come as a surprise but is a consequence of Rerun's data model where components are always stored as arrays. This enables, for example, to log an entire point cloud using the [`Points3D`](../../reference/types/archetypes/points3d.md) archetype under a single entity and at a single timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent clarification at just the right moment


Let's explore this further, recalling that, in our recording, no face was detected at around frame #170:

```python
print(df["/blendshapes/0/jawOpen:Scalar"][160:180])
```

Here is the result:

```
160 [0.0397215373814106]
161 [0.037685077637434006]
162 [0.0402931347489357]
163 [0.04329492896795273]
164 [0.0394592322409153]
165 [0.020853394642472267]
166 []
167 []
168 []
169 []
170 []
171 []
172 []
173 []
174 []
175 []
176 []
177 []
178 []
179 []
Name: /blendshapes/0/jawOpen:Scalar, dtype: object
```

We note that the data contains empty lists when no face is detected. When the blendshapes entities are [`Clear`](../../reference/types/archetypes/clear.md)ed, this happens for the corresponding timestamps and all further timestamps until a new value is logged.

While this data representation is in general useful, a flat floating point representation with NaN for missing values is typically more convenient for scalar data. This is achieved using the [`explode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) method:

```python
df["jawOpen"] = df["/blendshapes/0/jawOpen:Scalar"].explode().astype(float)
print(df["jawOpen"][160:180])
```
Here is the result:
```
160 0.039722
161 0.037685
162 0.040293
163 0.043295
164 0.039459
165 0.020853
166 NaN
167 NaN
168 NaN
169 NaN
170 NaN
171 NaN
172 NaN
173 NaN
174 NaN
175 NaN
176 NaN
177 NaN
178 NaN
179 NaN
Name: jawOpen, dtype: float64
```

This confirms that the newly created `"jawOpen"` column now contains regular, 64-bit float numbers, and missing values are represented by NaNs.

_Note_: should you want to filter out the NaNs, you may use the [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method.

## Next steps

With this, we are ready to analyze the data and log back the result to the Rerun viewer, which is covered in the [next section](analyze-and-log.md) of this guide.
2 changes: 1 addition & 1 deletion docs/content/getting-started/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Troubleshooting
order: 600
order: 800
---

You can set `RUST_LOG=debug` before running to get some verbose logging output.
Expand Down
Loading
Loading