Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement appendable tabular data #679

Merged
merged 18 commits into from
Mar 15, 2024
Merged

Conversation

skarakuzu
Copy link
Contributor

@skarakuzu skarakuzu commented Mar 4, 2024

work in progress implementing append data to file

@skarakuzu skarakuzu marked this pull request as draft March 4, 2024 17:56
@danielballan danielballan changed the title Write as csv Implement appendable tabular data Mar 5, 2024
@danielballan
Copy link
Member

How can we make the CSVAdapter work with the register_single_item code?

adapter_factory = settings.adapters_by_mimetype[mimetype]
logger.info(" Resolved mimetype '%s' with adapter for '%s'", mimetype, item)
try:
adapter = await anyio.to_thread.run_sync(adapter_factory, ensure_uri(item))
except Exception:
logger.exception(" SKIPPED: Error constructing adapter for '%s':", item)
return
key = settings.key_from_filename(item.name)
return await create_node_or_drop_collision(
node,
key=key,
structure_family=adapter.structure_family,
metadata=dict(adapter.metadata()),
specs=adapter.specs,
data_sources=[
DataSource(
structure_family=adapter.structure_family,
mimetype=mimetype,
structure=dict_or_none(adapter.structure()),
parameters={},
management=Management.external,
assets=[
Asset(
data_uri=ensure_uri(item),
is_directory=is_directory,
parameter="data_uri",
)
],
)
],
)

This code typically operates with adapters that expect a single parameter data_uri that takes one path. But we want CSVAdapter to work, in the general case, for data_uris, a list of partitioned files. How can we manage that?

Notice that in the code above, we construct an instance of the adapter

adapter = await anyio.to_thread.run_sync(adapter_factory, ensure_uri(item))

and then we ask adapter to describe aspects of this data.... adapter.structure_family, adapter.metadata(), adapter.structure(), and so on. I think we should extend this same pattern: we should ask the adapter to write the whole data_sources list for us, since it knows the specific parameters it expects. Thus, we should change:

    return await create_node_or_drop_collision(
        ...,
        data_sources=[
            DataSource(
                structure_family=adapter.structure_family,
                mimetype=mimetype,
                structure=dict_or_none(adapter.structure()),
                parameters={},
                management=Management.external,
                assets=[
                    Asset(
                        data_uri=ensure_uri(item),
                        is_directory=is_directory,
                        parameter="data_uri",
                    )
                ],
            )
        ],
    )

to

    if hasattr(adapter, "generate_data_sources"):
        # Let the Adapter describe the DataSouce(s).
        data_sources = adapter.generate_data_sources()
    else:
        # Back-compat: Assume one Asset passed as a
        # parameter named 'data_uri'.
        data_source = [
            DataSource(
                structure_family=adapter.structure_family,
                mimetype=mimetype,
                structure=dict_or_none(adapter.structure()),
                parameters={},
                management=Management.external,
                assets=[
                    Asset(
                        data_uri=ensure_uri(item),
                        is_directory=is_directory,
                        parameter="data_uri",
                    )
                ],
            )
        ],
    return await create_node_or_drop_collision(
        ...,
        data_sources=data_sources,
    )

Then, a new method CSVAdapter.generate_data_sources() can be defined, which returns:

[
            DataSource(
                structure_family=adapter.structure_family,
                mimetype=mimetype,
                structure=dict_or_none(adapter.structure()),
                parameters={},
                management=Management.external,
                assets=[
                    Asset(
                        data_uri=ensure_uri(item),
                        is_directory=is_directory,
                        parameter="data_uris",  # <-- PLURAL!
                        num=0  # <-- denoting that the Adapter expects a list, and this is the first element
                    )
                ],
            )
        ]

@padraic-shafer
Copy link
Contributor

Naive question: Why does CsvAdapter use multiple data sources?

Does the strategy include writing appended rows to a new file? Or is this "chunking" of columns? or...?

@danielballan
Copy link
Member

This is chunking of partitions:

# partition-0.csv
1, 10.0, a
2, 20.0 b
3, 30.0, c
# partition-1.csv
4, 4.0, d
5, 5.0, e
6, 6.0, f

When Tiled wrties CSVs, it adopts this strategy for scability. But when CSV registers CSVs may encounter partitioned or non-partitioned files and should be able to manage both, potentially.

Appending extends a specific partition, i.e. adds rows for an existing file. New partitions may also be created as needed.

@danielballan
Copy link
Member

Rebased on #686. The second force-push was to retain @skarakuzu's authorship is retained in the commit history.

@danielballan danielballan marked this pull request as ready for review March 15, 2024 16:30
@danielballan danielballan merged commit 428834d into bluesky:main Mar 15, 2024
6 checks passed
hyperrealist pushed a commit to hyperrealist/tiled that referenced this pull request Mar 18, 2024
* append method in the client side

* Support writing tables as CSV.

* implemented appending to files

* fix bug in a test

* Add ability to parse file structure in .csvAdapter 📂

* fix some more bugs

* fix some more tests

* cleaned the files

* fixed black and flake errors

* implemented append method in the client side

* implemented append method in the client side

* added an integration test for the new append method

* Added a CHANGELOG file

* changed the append test to parametrized

* fixed failing string test

* fix append strng test

* changed the type check in assert_frame_equal

* Elaborate on changelog.

---------

Co-authored-by: Seher Karakuzu <[email protected]>
Co-authored-by: Dan Allan <[email protected]>
Co-authored-by: Seher Karakuzu <[email protected]>
Co-authored-by: kari Barry <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants