Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add data for archive 2024_07_01 #228

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

Nigui
Copy link
Contributor

@Nigui Nigui commented Aug 26, 2024

This PR is here to expose and suggest a fix to inaccurate computed data when third party runs a script from dynamic subdomain.

What's wrong here is that some third parties use a dynamic subdomain to serve its main script on websites (e.g .domain.com). Some of these subdomain scripts are saved under observed-domains JSON file as results of the sql/all-observed-domains-query.sql query but analyzing http archive database we found a lot that are ignored because of number of occurrences (less than 50 cf SQL query having clause).

In this MR we've rewritten all-observed-domains-query.sql to fix this issue.
To sum-up the change, we keep observed domains with occurence below 50 only if its mapped entity (based on entity.js) has a total occurence (of all its declared domain) greater than 50.

Rest of the flow remains the same. It may be optimized in the future.

We don't rewrite existing data but compute fresh data on July 2024 httparchive with new query instead

Copy link

vercel bot commented Aug 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
third-party-web ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 13, 2024 6:47am

Copy link
Owner

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for the hard work here!

I like the high-level approach of including all observed domains from entities who have more than 50 domains, but I have two primary concerns with the specific impl here.

  1. It creates a circular dependency problem for observed domains relying on entity mappings.
  2. The data is now too big to continue with our "checked into the repo as JSON" approach.

I have a few ideas on these but would like to hear what you think.

What if we preserve the existing step of observed domains with no dependency on entities (which helps with the scripts in the repo that aid in identifying new domains that require classification), and added a new step for your inclusion of the lower volume domains that are missed that lives in a bigquery table instead of checked in?

@Nigui
Copy link
Contributor Author

Nigui commented Aug 27, 2024

Thank you @patrickhulce.

Ok i didn't noticed that use-case of newly observed domains classification.

About file size, yes that's a real issue i faced (blocked by github file size and had to transform flow to stream for scalability #224 ). So we can't remove the occurence limitation on query that generates an uploaded file (observed domains).

I like what I understand from your suggestion but could you please confirm I'm understanding it well?

You suggest to keep the initial query to get observed domains (without any mapping to entity but with minimum of 50 occurrences) and store result in a file uploaded to the repo (basically this stream), right? So, just get back the initial behavior. Problems are that generated file YYYY-MM-DD-observed-domain.json won't contain all data that lead to compute performance analysis and domain-map.csv file (built on top of YYYY-MM-DD-observed-domain.json) won't contain domains with occurence below 50. It could confuse reviewers and raise questions 🤔 .

Then we add an extra step which runs the new query (with entity mapping and limitation at entity level), then saves mapped observed domains into a dedicated table (it could be the existing third_party_web/YYYY_MM_DD table). It's where real data will be stored but based on data updaters they may not be public (as it requires write permissions to lighthouse-infrastructure/third_party_web dataset).

Finally, the last step would not change and keep using the third_party_web/YYYY_MM_DD table to run the performance computation query.

@Nigui
Copy link
Contributor Author

Nigui commented Sep 2, 2024

I've updated script according to previous discussion. Let me know if it's what you expected @patrickhulce.
By the way, before merging this MR, I'll have to (re)generate data with fix #229

Copy link

vercel bot commented Sep 11, 2024

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@Nigui
Copy link
Contributor Author

Nigui commented Sep 11, 2024

Hello @patrickhulce
I've merged master on this branch then use fixed and automated script to compute data for july.
Do you agree with this new data computation method? If so, would it be ok to merge ?
Thank you 🙏

@patrickhulce
Copy link
Owner

Thanks @Nigui ! Will need a manual rebase I think though.

…50 different pages and generate 2024_07 data

feat: keep saving in file all observed domains with minimum observations

fix: tpw table exists

feat: automated script splitting into 3 steps, add logs and clean table if needed

feat: compute data for 2024_07_01
@Nigui
Copy link
Contributor Author

Nigui commented Sep 13, 2024

Thanks @Nigui ! Will need a manual rebase I think though.

Sorry i merged master instead of rebase. But it's ok now :)

@patrickhulce patrickhulce merged commit eb07e11 into patrickhulce:master Sep 13, 2024
7 checks passed
Copy link

🎉 This PR is included in version 0.26.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@Kporal Kporal deleted the feat/2024-07-01 branch September 16, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants