feat: add data for archive 2024_07_01 #228

Nigui · 2024-08-26T12:36:53Z

This PR is here to expose and suggest a fix to inaccurate computed data when third party runs a script from dynamic subdomain.

What's wrong here is that some third parties use a dynamic subdomain to serve its main script on websites (e.g .domain.com). Some of these subdomain scripts are saved under observed-domains JSON file as results of the sql/all-observed-domains-query.sql query but analyzing http archive database we found a lot that are ignored because of number of occurrences (less than 50 cf SQL query having clause).

In this MR we've rewritten all-observed-domains-query.sql to fix this issue.
To sum-up the change, we keep observed domains with occurence below 50 only if its mapped entity (based on entity.js) has a total occurence (of all its declared domain) greater than 50.

Rest of the flow remains the same. It may be optimized in the future.

We don't rewrite existing data but compute fresh data on July 2024 httparchive with new query instead

vercel · 2024-08-26T12:36:59Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
third-party-web	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Sep 13, 2024 6:47am

patrickhulce

thanks so much for the hard work here!

I like the high-level approach of including all observed domains from entities who have more than 50 domains, but I have two primary concerns with the specific impl here.

It creates a circular dependency problem for observed domains relying on entity mappings.
The data is now too big to continue with our "checked into the repo as JSON" approach.

I have a few ideas on these but would like to hear what you think.

What if we preserve the existing step of observed domains with no dependency on entities (which helps with the scripts in the repo that aid in identifying new domains that require classification), and added a new step for your inclusion of the lower volume domains that are missed that lives in a bigquery table instead of checked in?

Nigui · 2024-08-27T08:13:46Z

Thank you @patrickhulce.

Ok i didn't noticed that use-case of newly observed domains classification.

About file size, yes that's a real issue i faced (blocked by github file size and had to transform flow to stream for scalability #224 ). So we can't remove the occurence limitation on query that generates an uploaded file (observed domains).

I like what I understand from your suggestion but could you please confirm I'm understanding it well?

You suggest to keep the initial query to get observed domains (without any mapping to entity but with minimum of 50 occurrences) and store result in a file uploaded to the repo (basically this stream), right? So, just get back the initial behavior. Problems are that generated file YYYY-MM-DD-observed-domain.json won't contain all data that lead to compute performance analysis and domain-map.csv file (built on top of YYYY-MM-DD-observed-domain.json) won't contain domains with occurence below 50. It could confuse reviewers and raise questions 🤔 .

Then we add an extra step which runs the new query (with entity mapping and limitation at entity level), then saves mapped observed domains into a dedicated table (it could be the existing third_party_web/YYYY_MM_DD table). It's where real data will be stored but based on data updaters they may not be public (as it requires write permissions to lighthouse-infrastructure/third_party_web dataset).

Finally, the last step would not change and keep using the third_party_web/YYYY_MM_DD table to run the performance computation query.

Nigui · 2024-09-02T10:27:41Z

I've updated script according to previous discussion. Let me know if it's what you expected @patrickhulce.
By the way, before merging this MR, I'll have to (re)generate data with fix #229

vercel · 2024-09-11T11:17:41Z

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

Nigui · 2024-09-11T11:23:03Z

Hello @patrickhulce
I've merged master on this branch then use fixed and automated script to compute data for july.
Do you agree with this new data computation method? If so, would it be ok to merge ?
Thank you 🙏

patrickhulce · 2024-09-13T01:34:17Z

Thanks @Nigui ! Will need a manual rebase I think though.

…50 different pages and generate 2024_07 data feat: keep saving in file all observed domains with minimum observations fix: tpw table exists feat: automated script splitting into 3 steps, add logs and clean table if needed feat: compute data for 2024_07_01

Nigui · 2024-09-13T07:22:27Z

Thanks @Nigui ! Will need a manual rebase I think though.

Sorry i merged master instead of rebase. But it's ok now :)

github-actions · 2024-09-13T19:46:06Z

🎉 This PR is included in version 0.26.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

vercel bot deployed to Preview August 26, 2024 12:37 View deployment

Nigui mentioned this pull request Aug 26, 2024

fix(2024-06-01): use registered domain for third parties to count sub-domains #225

Closed

patrickhulce reviewed Aug 26, 2024

View reviewed changes

vercel bot deployed to Preview September 2, 2024 10:26 View deployment

patrickhulce approved these changes Sep 13, 2024

View reviewed changes

Nigui force-pushed the feat/2024-07-01 branch from 3a3e145 to bf405d4 Compare September 13, 2024 06:47

vercel bot deployed to Preview September 13, 2024 06:47 View deployment

patrickhulce merged commit eb07e11 into patrickhulce:master Sep 13, 2024
7 checks passed

github-actions bot added the released label Sep 13, 2024

Kporal deleted the feat/2024-07-01 branch September 16, 2024 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add data for archive 2024_07_01 #228

feat: add data for archive 2024_07_01 #228

Nigui commented Aug 26, 2024

vercel bot commented Aug 26, 2024 •

edited

Loading

patrickhulce left a comment

Nigui commented Aug 27, 2024

Nigui commented Sep 2, 2024

vercel bot commented Sep 11, 2024

Nigui commented Sep 11, 2024

patrickhulce commented Sep 13, 2024

Nigui commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

feat: add data for archive 2024_07_01 #228

feat: add data for archive 2024_07_01 #228

Conversation

Nigui commented Aug 26, 2024

vercel bot commented Aug 26, 2024 • edited Loading

patrickhulce left a comment

Choose a reason for hiding this comment

Nigui commented Aug 27, 2024

Nigui commented Sep 2, 2024

vercel bot commented Sep 11, 2024

Nigui commented Sep 11, 2024

patrickhulce commented Sep 13, 2024

Nigui commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

vercel bot commented Aug 26, 2024 •

edited

Loading