Releases: fako/datagrowth
Workshop version
A version that enables a playground for ideas originating from an AI workshop.
This update is the first Datagrowth version that includes the DatasetVersion
model.
The implementation of that model can be a steep change over current implementation.
However it's not required to implement Datagrowth's DatasetVersion
to update to v0.20.
Instead you can run your own DatasetVersion
which should implement influence
or set the
dataset_version
attribute to None for Collection
and Document
if you don't want to use any DatasetVersion
.
Other important changes:
- Minimal version for Celery is now 5.x.
- Minimal version for jsonschema is now 4.20.0, but jsonschema draft version remains 4.
global_pipeline_app_label
andglobal_pipeline_models
configurations have been renamed
toglobal_datatypes_app_label
andglobal_datatype_models
.- The
extractor
,depends_on
,to_property
andapply_to_resource
configurations are now
part of thegrowth_processor
namespace. - The
batch_size
setting is now part of the default global configuration namespace. - The configuration
async
will no longer get patched toasynchronous
to be compatible with Python >= 3.7.
Instead supplyasynchronous
directly and replace allasync
occurrences. load_config
decorator no longer excepts default values. Useregister_defaults
instead.- When using
ConfigurationType.supplement
default values are now ignored when determining if values exist. - The
pipeline
attributes gets replaced by thetask_results
attributes forDocument
,Collection
and
DatasetVersion
. - When writing contributions to
Documents
the default field is nowderivatives
.
Furthermore a key equal to thegrowth_phase
is automatically added to thederivatives
dictionary.
The value for this key is always an empty dictionary. Anyto_property
configuration will write to this dictionary.
Otherwise contributions get merged into the dictionary. It's still possible to write toproperties
without
adding specialgrowth_phase
keys for backward compatability. - Contributions to
Documents
gathered throughExtractProcessor.pass_resource_through
may consist of simple values. Ifto_property
is set these values will be available under that property.
Otherwise the simple values get added to a dictionary with one "value" key and
this dictionary gets merged like normal. - If
ResourceGrowthProcessor
encounters multipleResources
perDocument
or
if a singleResource
yields multiple results. Then thereduce_contributions
method will be called to
determine how contribution data fromResources
should complimentDocument
data. The default is to only use
the first result that comes fromResources
in order to be backward compatible. Resource
class now exposesvalidate_input
to override in child classes for input validation.
This validation strategy will replace JSONSchema based validation for performance reasons in the future.- Adds a
TestClientResource
that allows to createResources
that connect to Django views which return test data.
Especially useful when testing Datagrowth components that takeHttpResources
as arguments. - Importing
DataStorage
fromdatagrowth.datatypes.documents.db.base
has to be replaced
with importing fromdatagrowth.datatypes.storage
. - The
DataStorages
dataclass has been added to manage typing for dynamically loadedDataStorage
models. - The
DatasetVersion.task_definitions
field holds dictionaries perDataStorage
model that specifies,
which tasks should run for which model. - The
DatasetVersion.errors
field has aseeding
andtasks
field where some basic error information is kept
for debugging purposes. - A
DatasetVersion
will influence itsCollections
andDocuments
.
Collections
may setDatasetVersion
forDocuments
and facilitateDatasetVersion
influence for them. - Task definitions given to
DatasetVersion
propagate toCollection
andDocument
through the influence method. - The
Dataset.create_dataset_version
method will create a non-pendingDatasetVersion
with the defaultGROWTH_STRATEGY
andDatasetVersion.tasks
set.
It also creates a default non-pendingCollection
withCollection.tasks
set.
Customize defaults by settingDOCUMENT_TASKS
,COLLECTION_TASKS
,DATASET_VERSION_TASKS
,
COLLECTION_IDENTIFIER
,COLLECTION_REFEREE
andDATASET_VERSION_MODEL
.
Or overrideDataset.get_collection_factories
,Dataset.get_seeding_factories
and/or
Dataset.get_task_definitions
for more control. Document.invalidate_task
will now always setpending_at
andfinished_at
attributes,
regardless of whether tasks have run before.- The
content
of a Document now contains output fromderivatives
throughDocument.get_derivatives_content
. - Calling
validate_pending_data_storages
now may updateDatasetVersion.is_current
andDatasetVersion.errors
. - Commands inheriting from
DatasetCommand
that expectCommunity
compliant objects,
should setcast_as_community
to True on the Command class and renamehandle_dataset
tohandle_community
. - Unlike the legacy
Community
model aDataset
has a unique signature. If the signature of aDataset
matches
an existingDataset
thegrowth
method will create a newDatasetVersion
instead of a differentDataset
.
Workshop version (prerelease)
A first version that enables a playground for ideas originating from an AI workshop.
Resource iterators
This version allows for the use of Resource
iterators. Which enables applications to retrieve and process Resources using generators instead of loading everything in memory. To optimally make use of this feature Collection
also exposes an iterative interface to add and update Documents
.
- Adds support for Python 3.12.
- Doesn't specify a specific parser for BeautifulSoup when loading XML content.
BeautifulSoup warns against using Datagrowth's previous default parser (lxml) for XML parsing as it is less reliable. - Allows
ExtractProcessor
to extract data using a generator function for the "@" objective.
This can be useful to extract from nested data structures. - Provides a
send_iterator
generator that initiates and sends aHttpResource
as well as any subsequentHttpResources
. This generator allows you to do something with in-between results
when fetching the data. - Provides a
send_serie_iterator
generator which acts like thesend_iterator
except it can perform multiple send calls. - Provides a
content_iterator
generator that given asend_iterator
orsend_serie_iterator
will extract the content from generatedHttpResources
using a given objective.
This generator will also yield in-between results as extracted content. - Adds
Collection.add_batches
andCollection.update_batches
which are variants on
Collection.add
andCollection.update
that will return generators
instead of adding/updating everything in-memory. - The
Collection.update
,Collection.add
,Collection.update_batches
andCollection.add_batches
will
check for equality betweenDocuments
before adding or updating. This makes it possible to skip insert/updates in
particular cases by overridingDocument.__eq__
.Collection.add
andCollection.add_batches
require
input as a list for this to work to prevent unexpected excessive memory usage. - When using
Collection.add_batches
orCollection.update_batches
aNO_MODIFICATION
object can be passed
asmodified_at
parameter to prevent updatingCollection.modified_at
with these (repeating) calls. - Uses
Collection.document_update_fields
to determine which fields to update inbulk_update
calls by Collection. - Adds
Document.build
to support creating aDocument
from raw data. Document.update
will now use properties as update data instead of content
when giving anotherDocument
as data argument.- Deprecates
Collection.init_document
in favour ofCollection.build_document
for consistency in naming. Document.output_from_content
will now return lists instead of mapping generators when giving multiple arguments.
The convenience of lists is more important here than memory footprint which will be minimal anyway.- Makes
Document.output_from_content
pass along content if values are not a JSON path. - Allows
Document.output_from_content
to use different starting characters for replacement JSON paths. ConfigurationField.contribute_to_class
will first call theTextField.contribute_to_class
before settingConfigurationProperty
upon the class.- Removes validate parameter from
Collection.add
,Collection.update
andDocument.update
. - Moved
load_session
decorator intodatagrowth.resources.http
. - Moved
get_resource_link
function intodatagrowth.resources.http
. - Sets default batch size to a smaller 100 elements per batch and
Collection.update
now respects this default. - Removes implicit Indico and Wizenoze API key loading.
- Corrects log names to "datagrowth" instead of "datascope".
- Adds a
copy_dataset
command that will copy a dataset by signature. - The
async
configuration has been removed from settings file. - A
resource_exception_log_level
setting now controls at what levelDGResourceExceptions
will get logged. - Additionally
resource_exception_reraise
now controls whetherDGResourceExceptions
get reraised. - Fallback for
JSONField
imports fromdjango.contrib.postgres.fields
has been removed. - Adds
global_allow_redirects
configuration which controls how requests library will handle redirects.
Defaults to True even for "head" requests. - Exposes
ProcessorFactory
andDataStorageFactory
to easily build processors and datatypes in the future. - Adds the
Collection.reload_document_ids
method to be able to loadDocument.id
afterbulk_create
. - For consistent
Resource
serialization addsserialize_resources
andupdate_serialized_resources
. - Experimental support for
ResourceFixturesMixin
that can be used to load resource content through fixture files.
Python 3.11 and Django 4.2
Updates the package to support Python 3.11 and Django 4.2.
Python 3.10
- Adds support for Python 3.10 and drops support for Python 3.6.
- Uses the html.parser instead of html5lib parser when parsing HTML pages.
- Fetches the last
Resource
when retrieving from cache to preventMultipleObjectsReturned
exceptions in async environments - Allows PUT as a
HttpResource
send method
Django 3.2
Updates the package to support Django 3.2 features. It further supports Document and Collection models, which are now unit tested.
These are the breaking changes this release:
- It's recommended to update to Django 3.2 before using Datagrowth 0.17.
- Note that a Django migration is required to make Datagrowth 0.17 work.
- Drops support for Django 1.11.
- MySQL backends are no longer supported with Django versions below 3.2
- Schemas on
Document
andCollection
are removed as their usage is not recommended.
Consider working schemaless when using theseDataStorage
derivative classes. - As schemas are no longer available for
DataStorage
derivative classes all write functionality
from defaultDataStorage
API views is removed. DataStorage
API URL patterns now require app labels as namespaces to prevent ambiguity.- The API version can be specified using the
DATAGROWTH_API_VERSION
setting. DataStorage.update
is reintroduced because of potential performance benefits.Document.update
no longer takes first values from iterators given to it.Collection.update
no longer excepts a single dict or Document for updating.
It also works using lookups fromJSONField
instead of the inferiorreference
mechanic.DataStorage.url
now provides a generic way to build URLs forCollection
andDocument
.
These URLs will expect URL patterns to exist with names in the format:
v<api-version>:<app-name>:<model-name>-content.
This replaces the old formats which were less flexible:
v1:<app-name>:collection-content and v1:<app-name>:document-content.HttpResource
will usedjango.contrib.postgres.fields.JSONField
ordjango.db.models.JSONField
forrequest
andhead
fields.ShellResource
will usedjango.contrib.postgres.fields.JSONField
ordjango.db.models.JSONField
for thecommand
field.- The resources and datatypes modules now each have an admin module to import
AdminModels
easily. ConfigurationProperty
now uses a simpler constructor and allows defaults for all arguments.- Removes the unused
global_token
default configuration. - Removes the unused
http_resource_batch_size
default configuration.
Python 3.8
A minor update that drops support for Python 3.5 and adds support for Python 3.8.
It also prepares some updates that are coming up in the near future.
Datagrowth (package)
The first release of the Datagrowth package to be installed in projects.
After copy pasting this code a few times across projects it was time for a package to make maintenance a lot easier.
This versions contains fully functioning, tested and documented Resources & Configuration classes.
As well as some more experimental code that is to be released in full at a later date.
Below are the breaking changes that occur with this release:
- Renamed exceptions that are prefixed with DS to names prefixed with DG.
This migrates Datascope exceptions to Datagrowth exceptions.
Affected exceptions:DSNoContent
,DSHttpError403LimitExceeded
,DSHttpError400NoToken
,DSHttpWarning300
andDSInvalidResource
. batchize
used to be a function that returned batches and possibly a leftover batch.
Nowibatch
creates batches internally.reach
no longer excepts paths not starting with$
- Collection serializers do not include their content by default any more.
Add it yourself by appending to default_fields or use the collection-content endpoint. - A
google_cx
config value is no longer provided by default.
It should come from theGOOGLE_CX
setting in your settings file. - The
register_config_defaults
alias is no longer available. Useregister_defaults
directly. - The
MOCK_CONFIGURATION
alias is no longer available.
Omit the configuration altogether and useregister_defaults
. - Passing a default configuration to
load_config
is deprecated. Useregister_defaults
instead. ExtractProcessor
now raisesDGNoContent
.fetch_only
renamed tocache_only
- Non-existing resources will now raise a
DGResourceDoesNotExist
ifcache_only
is True meta
property is removed fromResource
usevariables
method instead.- All data hashes will be invalidated, because hasher now sorts keys.
schema
is allowed to be empty onDataStorage
, which means there will be no validation by default.
This is recommended, but requires migrations for some projects._handle_errors
has been renamed tohandle_errors
and is an explicit candidate for overriding._update_from_response
has been renamed to_update_from_results
for more consistent Resource api.
Datagrowth (prerelease)
The first release of the Datagrowth package to be installed in projects.
After copy pasting this code a few times across projects it was time for a package to make maintenance a lot easier.
This versions contains fully functioning, tested and documented Resources & Configuration classes.
As well as some more experimental code that is to be released in full at a later date.