We try to be backwards compatible as much as we can, but this file keeps track of breaking changes in the Datagrowth package.
Under each version number you'll find a section, which indicates breakages that you may expect when upgrading from lower versions.
This update is the first Datagrowth version that includes the DatasetVersion
model.
The implementation of that model can be a steep change over current implementation.
However it's not required to implement Datagrowth's DatasetVersion
to update to v0.20.
Instead you can run your own DatasetVersion
which should implement influence
or set the
dataset_version
attribute to None for Collection
and Document
if you don't want to use any DatasetVersion
.
- Minimal version for Celery is now 5.x.
- Minimal version for jsonschema is now 4.20.0, but jsonschema draft version remains 4.
global_pipeline_app_label
andglobal_pipeline_models
configurations have been renamed toglobal_datatypes_app_label
andglobal_datatype_models
.- The
extractor
,depends_on
,to_property
andapply_to_resource
configurations are now part of thegrowth_processor
namespace. - The
batch_size
setting is now part of the default global configuration namespace. - The configuration
async
will no longer get patched toasynchronous
to be compatible with Python >= 3.7. Instead supplyasynchronous
directly and replace allasync
occurrences. load_config
decorator no longer excepts default values. Useregister_defaults
instead.- When using
ConfigurationType.supplement
default values are now ignored when determining if values exist. - The
pipeline
attributes gets replaced by thetask_results
attributes forDocument
,Collection
andDatasetVersion
. - When writing contributions to
Documents
the default field is nowderivatives
. Furthermore a key equal to thegrowth_phase
is automatically added to thederivatives
dictionary. The value for this key is always an empty dictionary. Anyto_property
configuration will write to this dictionary. Otherwise contributions get merged into the dictionary. It's still possible to write toproperties
without adding specialgrowth_phase
keys for backward compatability. - Contributions to
Documents
gathered throughExtractProcessor.pass_resource_through
may consist of simple values. Ifto_property
is set these values will be available under that property. Otherwise the simple values get added to a dictionary with one "value" key and this dictionary gets merged like normal. - If
ResourceGrowthProcessor
encounters multipleResources
perDocument
or if a singleResource
yields multiple results. Then thereduce_contributions
method will be called to determine how contribution data fromResources
should complimentDocument
data. The default is to only use the first result that comes fromResources
in order to be backward compatible. Resource
class now exposesvalidate_input
to override in child classes for input validation. This validation strategy will replace JSONSchema based validation for performance reasons in the future.- Adds a
TestClientResource
that allows to createResources
that connect to Django views which return test data. Especially useful when testing Datagrowth components that takeHttpResources
as arguments. - Importing
DataStorage
fromdatagrowth.datatypes.documents.db.base
has to be replaced with importing fromdatagrowth.datatypes.storage
. - The
DataStorages
dataclass has been added to manage typing for dynamically loadedDataStorage
models. - The
DatasetVersion.task_definitions
field holds dictionaries perDataStorage
model that specifies, which tasks should run for which model. - The
DatasetVersion.errors
field has aseeding
andtasks
field where some basic error information is kept for debugging purposes. - A
DatasetVersion
will influence itsCollections
andDocuments
.Collections
may setDatasetVersion
forDocuments
and facilitateDatasetVersion
influence for them. - Task definitions given to
DatasetVersion
propagate toCollection
andDocument
through the influence method. - The
Dataset.create_dataset_version
method will create a non-pendingDatasetVersion
with the defaultGROWTH_STRATEGY
andDatasetVersion.tasks
set. It also creates a default non-pendingCollection
withCollection.tasks
set. Customize defaults by settingDOCUMENT_TASKS
,COLLECTION_TASKS
,DATASET_VERSION_TASKS
,COLLECTION_IDENTIFIER
,COLLECTION_REFEREE
andDATASET_VERSION_MODEL
. Or overrideDataset.get_collection_factories
,Dataset.get_seeding_factories
and/orDataset.get_task_definitions
for more control. Document.invalidate_task
will now always setpending_at
andfinished_at
attributes, regardless of whether tasks have run before.- The
content
of a Document now contains output fromderivatives
throughDocument.get_derivatives_content
. - Calling
validate_pending_data_storages
now may updateDatasetVersion.is_current
andDatasetVersion.errors
. - Commands inheriting from
DatasetCommand
that expectCommunity
compliant objects, should setcast_as_community
to True on the Command class and renamehandle_dataset
tohandle_community
. - Unlike the legacy
Community
model aDataset
has a unique signature. If the signature of aDataset
matches an existingDataset
thegrowth
method will create a newDatasetVersion
instead of a differentDataset
.
- Adds support for Python 3.11, Python 3.12 and Django 4.2.
- Doesn't specify a specific parser for BeautifulSoup when loading XML content. BeautifulSoup warns against using Datagrowth's previous default parser (lxml) for XML parsing as it is less reliable.
- Allows
ExtractProcessor
to extract data using a generator function for the "@" objective. This can be useful to extract from nested data structures. - Provides a
send_iterator
generator that initiates and sends aHttpResource
as well as any subsequentHttpResources
. This generator allows you to do something with in-between results when fetching the data. - Provides a
send_serie_iterator
generator which acts like thesend_iterator
except it can perform multiple send calls. - Provides a
content_iterator
generator that given asend_iterator
orsend_serie_iterator
will extract the content from generatedHttpResources
using a given objective. This generator will also yield in-between results as extracted content. - Adds
Collection.add_batches
andCollection.update_batches
which are variants onCollection.add
andCollection.update
that will return generators instead of adding/updating everything in-memory. - The
Collection.update
,Collection.add
,Collection.update_batches
andCollection.add_batches
will check for equality betweenDocuments
before adding or updating. This makes it possible to skip insert/updates in particular cases by overridingDocument.__eq__
.Collection.add
andCollection.add_batches
require input as a list for this to work to prevent unexpected excessive memory usage. - When using
Collection.add_batches
orCollection.update_batches
aNO_MODIFICATION
object can be passed asmodified_at
parameter to prevent updatingCollection.modified_at
with these (repeating) calls. - The
Collection.add_batches
will copytask_results
andderivatives
fields from inputDocuments
if they exist. - Uses
Collection.document_update_fields
to determine which fields to update inbulk_update
calls by Collection. - Adds
Document.build
to support creating aDocument
from raw data. Document.update
will now use properties as update data instead of content when giving anotherDocument
as data argument.- Deprecates
Collection.init_document
in favour ofCollection.build_document
for consistency in naming. Document.output_from_content
will now return lists instead of mapping generators when giving multiple arguments. The convenience of lists is more important here than memory footprint which will be minimal anyway.- Makes
Document.output_from_content
pass along content if values are not a JSON path. - Allows
Document.output_from_content
to use different starting characters for replacement JSON paths. ConfigurationField.contribute_to_class
will first call theTextField.contribute_to_class
before settingConfigurationProperty
upon the class.- Removes validate parameter from
Collection.add
,Collection.update
andDocument.update
. - Moved
load_session
decorator intodatagrowth.resources.http
. - Moved
get_resource_link
function intodatagrowth.resources.http
. - Sets default batch size to a smaller 100 elements per batch and
Collection.update
now respects this default. - Removes implicit Indico and Wizenoze API key loading.
- Corrects log names to "datagrowth" instead of "datascope".
- Adds a
copy_dataset
command that will copy a dataset by signature. - The
async
configuration has been removed from settings file. - A
resource_exception_log_level
setting now controls at what levelDGResourceExceptions
will get logged. - Additionally
resource_exception_reraise
now controls whetherDGResourceExceptions
get reraised. - Fallback for
JSONField
imports fromdjango.contrib.postgres.fields
has been removed. - Adds
global_allow_redirects
configuration which controls how requests library will handle redirects. Defaults to True even for "head" requests. - Exposes
ProcessorFactory
andDataStorageFactory
to easily build processors and datatypes in the future. - Adds the
Collection.reload_document_ids
method to be able to loadDocument.id
afterbulk_create
. - For consistent
Resource
serialization addsserialize_resources
andupdate_serialized_resources
. - Experimental support for
ResourceFixturesMixin
that can be used to load resource content through fixture files. - Cancelling a
HttpFileResource
will result in an empty body instead of a body of None.
- Adds support for Python 3.10 and drops support for Python 3.6.
- Uses the html.parser instead of html5lib parser when parsing HTML pages.
- Fetches the last
Resource
when retrieving from cache to preventMultipleObjectsReturned
exceptions in async environments - Allows PUT as a
HttpResource
send method
- It's recommended to update to Django 3.2 before using Datagrowth 0.17.
- Note that a Django migration is required to make Datagrowth 0.17 work.
- Drops support for Django 1.11.
- MySQL backends are no longer supported with Django versions below 3.2
- Schemas on
Document
andCollection
are removed as their usage is not recommended. Consider working schemaless when using theseDataStorage
derivative classes. - As schemas are no longer available for
DataStorage
derivative classes all write functionality from defaultDataStorage
API views is removed. DataStorage
API URL patterns now require app labels as namespaces to prevent ambiguity.- The API version can be specified using the
DATAGROWTH_API_VERSION
setting. DataStorage.update
is reintroduced because of potential performance benefits.Document.update
no longer takes first values from iterators given to it.Collection.update
no longer excepts a single dict or Document for updating. It also works using lookups fromJSONField
instead of the inferiorreference
mechanic.Collection.add
applies stricter type checking:dict
andDocument
are no longer allowed.DataStorage.url
now provides a generic way to build URLs forCollection
andDocument
. These URLs will expect URL patterns to exist with names in the format: v::-content. This replaces the old formats which were less flexible: v1::collection-content and v1::document-content.- Usage of the
DocumentPostgres
andMysqlDocument
is deprecated. Remove these as base classes. HttpResource
will usedjango.contrib.postgres.fields.JSONField
ordjango.db.models.JSONField
forrequest
andhead
fields.ShellResource
will usedjango.contrib.postgres.fields.JSONField
ordjango.db.models.JSONField
for thecommand
field.- The resources and datatypes modules now each have an admin module to import
AdminModels
easily. ConfigurationProperty
now uses a simpler constructor and allows defaults for all arguments.- Removes the unused
global_token
default configuration. - Removes the unused
http_resource_batch_size
default configuration. - HTTP errors 420, 429, 502, 503 and 504 will now trigger a backoff delay.
When this happens the HttpResource will sleep for the amount of seconds
specified in the
global_backoff_delays
setting. Setglobal_backoff_delays
to an empty list to disable this behaviour. - Allows override of
HttpResource.uri_from_url
andHttpResource.hash_from_data
- To extract from object values you know need to set
extract_processor_extract_from_object_values
to True. The default is False and will result in extraction from the object directly. ShellResource
now implementsinterval_duration
to allow the system to pause between runs. Useful when the command has some sort of rate limit.ExtractProcessor
now supports application/xml content type.
- Adding support for
Python 3.8
and removing support forPython 3.5
. - Updating
psycopg2-binary
to2.8.4
. - HTTP tasks no longer use
core
as a prefix, buthttp_resource
instead. - Shell tasks no longer use
core
as a prefix, butshell_resource
instead. - HTTP task and shell task configurations require an app label prefix for any
Resource
. load_session
decorator now excepts None as a session and will create a requests.Session when it does.- The
update
method has been removed from theDataStorage
base class - The
data_hash
field may now be empty in the admin on anyResource
(requires a minor migration) - The sleep dictated by
interval_duration
is executed byHttpResource
not the http tasks ConfigurationType
still works with the "async" property, but migrates internally to "asynchronous"- Modern mime types like application/vnd.api+json get processed as application/json
- You can now specify to what
datetime
theResource.purge_after
should get set when aResource
gets saved. Thedict
specified in thepurge_after
configuration are kwargs to atimedelta
init. Thistimedelta
gets added todatetime.now
. This means that using{"days": 30}
aspurge_after
will set theResource.purge_after
to 30 days into the future upon creation. Theglobal_purge_after
default configuration should be an emptydict
.
- Renamed exceptions that are prefixed with DS to names prefixed with DG.
This migrates Datascope exceptions to Datagrowth exceptions.
Affected exceptions:
DSNoContent
,DSHttpError403LimitExceeded
,DSHttpError400NoToken
,DSHttpWarning300
andDSInvalidResource
. batchize
used to be a function that returned batches and possibly a leftover batch. Nowibatch
creates batches internally.reach
no longer excepts paths not starting with$
- Collection serializers do not include their content by default any more. Add it yourself by appending to default_fields or use the collection-content endpoint.
- A
google_cx
config value is no longer provided by default. It should come from theGOOGLE_CX
setting in your settings file. - The
register_config_defaults
alias is no longer available. Useregister_defaults
directly. - The
MOCK_CONFIGURATION
alias is no longer available. Omit the configuration altogether and useregister_defaults
. - Passing a default configuration to
load_config
is deprecated. Useregister_defaults
instead. ExtractProcessor
now raisesDGNoContent
.fetch_only
renamed tocache_only
- Non-existing resources will now raise a
DGResourceDoesNotExist
ifcache_only
is True meta
property is removed fromResource
usevariables
method instead.- All data hashes will be invalidated, because hasher now sorts keys.
schema
is allowed to be empty onDataStorage
, which means there will be no validation by default. This is recommended, but requires migrations for some projects._handle_errors
has been renamed tohandle_errors
and is an explicit candidate for overriding._update_from_response
has been renamed to_update_from_results
for more consistent Resource api.- Dumps KaldiNL results into an output folder instead of KaldiNL root.