-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shopify: Incremental and other improvements #222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steinitzu thanks! looks good. besides a few suggestions to simplify code
@adrianbr if you could look at the comment about created_at
behavior
|
||
Returns: | ||
Iterable[TDataItem]: A generator of products. | ||
""" | ||
page = shopify.Product.find( | ||
updated_at_min=updated_at.last_value, created_at_min=start_date | ||
params = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah interesting question if we should set created_at_min=start_date
as we did in the previous version.
- if we set it items that were created before start_date will never be extracted, even if they were updated after that date
- if we do not set it, any updated item may be extracted
@adrianbr any opinions? if not, I'd set it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better if we use a separate min_created_at
argument in the source I think.
It's a little problematic with end value loading to use the same start date, because an item's created and updated can each be in range of different chunks so they'd never get loaded even if it should be in the total range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right. we need to handle backloading differently.
- when end date is specified we use only
created_at_min
andcreated_at_max
to select range - when we have incremental load we specify
created_at_min
(set to start_date - which is also initial_value) andupdated_at_min
set to last_value
there's a grey are here: should we also track updates of records from backloaded range? in case of the above it is not the case. we'd need another parameter to set created_at and updated_at separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah I'd say it makes sense to load updates from backloaded range too.
What is the idea behind the created_at cutoff originally?
I'm struggling with seeing the usefulness in it beyond just filtering by updated_at
. I'd think old orders you want to ignore aren't updated anyway usually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are always getting the most recent state of the entity and update/create just filter which entities you want to get. so there's no way IMO to get previous state of it. anyway. I kind of agree with what you say.
- let's have our start/end date settings working only with
updated_at_min
andupdated_at_max
(this one you need to add) - I'd add another argument
created_at_min
to cutoff the entities created before that date. the argument is optional and defaults toFIRST_DAY_OF_MILLENNIUM
the typical usage would be like in the incremental_load_with_backloading
example: backloading and then incremental load. all based on updated_at.
On top of that we give possibility to filter out all records that were created before initial value of current_start_date
in the demo. so the updated_at_min
argument is always set to it, during backloading and on incremental load.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense, and I think it's more clear with the separate argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added created_at_min
arg now and test for it.
updated_at_max
is used now too.
sources/shopify_dlt/date_helper.py
Outdated
elif isinstance(value, date): | ||
return pendulum.datetime(value.year, value.month, value.day) | ||
elif isinstance(value, str): | ||
return pendulum.parse(value) # type: ignore[return-value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pendulum parse cannot be trusted, my take is to accept a few ISO like dates, we have it in dlt.common.time
module: parse_iso_like_datetime
. I'm also not sure that pendulum.parse
always returns datetime with timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'm using parse_iso_like_datetime
now. It does currently create date objects from iso dates so I'm checking for that https://github.com/dlt-hub/verified-sources/pull/222/files#diff-3c1b45073fee3f98a1147e876dab634e0676b984b509690c76a4957e62d9efa0R32-R33
|
||
Returns: | ||
Iterable[TDataItem]: A generator of products. | ||
""" | ||
page = shopify.Product.find( | ||
updated_at_min=updated_at.last_value, created_at_min=start_date | ||
params = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right. we need to handle backloading differently.
- when end date is specified we use only
created_at_min
andcreated_at_max
to select range - when we have incremental load we specify
created_at_min
(set to start_date - which is also initial_value) andupdated_at_min
set to last_value
there's a grey are here: should we also track updates of records from backloaded range? in case of the above it is not the case. we'd need another parameter to set created_at and updated_at separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! this is really good now!
Tell us what you do here
verified source
)Relevant issue
closes #217
More PR info