Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge develop #4

Open
wants to merge 6 commits into
base: prod
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 6 additions & 8 deletions .github/workflows/build_images.yml
Original file line number Diff line number Diff line change
@@ -1,26 +1,24 @@
name: "Build Docker Images"

on:
workflow_run:
workflows: ["Run Tests"]
types:
- completed
workflow_call:
secrets:
GOOGLE_APPLICATION_CREDENTIALS:
required: true


jobs:
build:
name: "Build Docker Images"
runs-on: ubuntu-latest
environment: ${GITHUB_REF##*/}
environment: ${{ github.ref }}

defaults:
run:
shell: bash
working-directory: ./

steps:
- name: "Terminate if tests failed"
if: ${{ github.event.workflow_run.conclusion != 'success' }}
run: exit 1

- name: Checkout
uses: actions/checkout@v3
Expand Down
10 changes: 10 additions & 0 deletions .github/workflows/on_pull_request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

name: "On Pull Request"

on:
pull_request:
branches: [ "prod", "dev" ]

jobs:
call-run-tests:
uses: ./.github/workflows/test.yml
23 changes: 23 additions & 0 deletions .github/workflows/on_push.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

name: "On Push"

on:
push:
branches: [ "prod", "dev" ]

jobs:
call-run-tests:
uses: ./.github/workflows/test.yml
call-build-images:
uses: ./.github/workflows/build_images.yml
secrets:
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
needs: call-run-tests
call-terraform:
uses: ./.github/workflows/terraform.yml
secrets:
GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
REDDIT_CLIENT_ID: ${{ secrets.REDDIT_CLIENT_ID }}
REDDIT_CLIENT_SECRET: ${{ secrets.REDDIT_CLIENT_SECRET }}
HUGGINGFACE_TOKEN: ${{ secrets.HUGGINGFACE_TOKEN }}
needs: call-build-images
19 changes: 11 additions & 8 deletions .github/workflows/terraform.yml
Original file line number Diff line number Diff line change
@@ -1,26 +1,29 @@
name: 'Terraform Apply'

on:
workflow_run:
workflows: ["Build Docker Images"]
types:
- completed
workflow_call:
secrets:
GOOGLE_APPLICATION_CREDENTIALS:
required: true
REDDIT_CLIENT_ID:
required: true
REDDIT_CLIENT_SECRET:
required: true
HUGGINGFACE_TOKEN:
required: true

jobs:
terraform:
name: 'Terraform'
runs-on: ubuntu-latest
environment: ${GITHUB_REF##*/}
environment: ${{ github.ref }}

defaults:
run:
shell: bash
working-directory: ./terraform

steps:
- name: "Terminate if building images failed"
if: ${{ github.event.workflow_run.conclusion != 'success' }}
run: exit 1

- name: Checkout
uses: actions/checkout@v3
Expand Down
5 changes: 1 addition & 4 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@
name: 'Run Tests'

on:
push:
branches: [ "prod", "dev" ]
pull_request:
branches: [ "prod", "dev" ]
workflow_call

permissions:
contents: read
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ pre-commit:
pre-commit run --all-files

test: pre-commit
cd tests && export REDDIT_CLIENT_ID="" REDDIT_CLIENT_SECRET="" SUBREDDITS="" HUGGINGFACE_TOKEN="" GCS_RAW_BUCKET_NAME="" GCS_TRANSFORMED_BUCKET_NAME="" && python3 -m pytest -v
cd tests && export REDDIT_CLIENT_ID="" REDDIT_CLIENT_SECRET="" SUBREDDITS="" HUGGINGFACE_TOKEN="" GCS_RAW_BUCKET_NAME="" GCS_TRANSFORMED_BUCKET_NAME="" BIGQUERY_DATASET_ID="" BIGQUERY_TABLE_ID="" && python3 -m pytest -v

first-time-setup:
gcloud artifacts repositories create etl-images --location=asia-southeast1 --repository-format=docker
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Interestingly, this project also yields a world ranking of universities based on

!["Architecture"](images/architecture.drawio.png)

All infrastructure is hosted on Google Cloud Platform and managed via Terraform.
All infrastructure is hosted on Google Cloud Platform and managed via Terraform. The `dev` and `prod` branches each have their own separate sets of infrastructure, which are deployed automatically upon passing CI/CD.

- Idempotent ETL scripts are Python 3.11 docker containers running on Cloud Run. Backfilling can be triggered by sending requests to the extract container.

Expand Down
4 changes: 2 additions & 2 deletions src/common/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
GCS_TRANSFORMED_BUCKET_NAME = os.environ["GCS_TRANSFORMED_BUCKET_NAME"]

BIGQUERY_PROJECT_ID = "university-subreddits"
BIGQUERY_DATASET_ID = "subreddit_metrics"
BIGQUERY_TABLE_ID = "subreddit_metrics"
BIGQUERY_DATASET_ID = os.environ["BIGQUERY_DATASET_ID"]
BIGQUERY_TABLE_ID = os.environ["BIGQUERY_TABLE_ID"]

HUGGINGFACE_TOKEN = os.environ["HUGGINGFACE_TOKEN"]
HUGGINGFACE_MODEL = "finiteautomata/bertweet-base-sentiment-analysis"
38 changes: 7 additions & 31 deletions src/common/reddit_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,37 +38,12 @@ def _remove_submissions_not_on_date(
submissions: list[dict],
date: Date,
) -> list[dict]:
"""Trims the post list from the front and back.
Assumes that `submissions` is sorted in descending order of `created_utc`
"""
if len(submissions) == 0:
return []

latest_submission_date = datetime.utcfromtimestamp(submissions[0]["created_utc"]).date()
oldest_submission_date = datetime.utcfromtimestamp(submissions[-1]["created_utc"]).date()
no_submissions_made_on_date = (date > latest_submission_date) or (oldest_submission_date > date)
if no_submissions_made_on_date:
return []

for start in range(len(submissions)):
created_datetime = datetime.utcfromtimestamp(
submissions[start]["created_utc"],
)
should_remove_post = created_datetime.date() > date
if not should_remove_post:
submissions = submissions[start:]
break

for end in range(len(submissions) - 1, -1, -1):
created_datetime = datetime.utcfromtimestamp(
submissions[end]["created_utc"],
)
should_remove_post = created_datetime.date() < date
if not should_remove_post:
submissions = submissions[: end + 1]
break

return submissions
"""Removes submissions not made on date"""
return [
submission
for submission in submissions
if datetime.utcfromtimestamp(submission["created_utc"]).date() == date
]

def fetch_submissions_made_on_date(self, subreddit: str, date: Date) -> list[dict]:
posts_made_on_date = []
Expand Down Expand Up @@ -118,4 +93,5 @@ def fetch_submissions_made_on_date(self, subreddit: str, date: Date) -> list[dic
posts_made_on_date,
date,
)
print(len(posts_made_on_date))
return posts_made_on_date
6 changes: 3 additions & 3 deletions src/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,10 @@ def parse_and_check_date(input_date: str) -> Date:
date = datetime.strptime(input_date, "%d/%m/%Y").date()

today = datetime.now().date()
more_than_a_month_ago = (today - date).days > 30
if more_than_a_month_ago:
more_than_ten_days_ago = (today - date).days > 10
if more_than_ten_days_ago:
raise ValueError(
"Use extract_backfill.py to extract posts made more than a month ago.",
"Cannot extract for dates made more than 10 days ago. Data loss may occur.",
)

return date
Expand Down
24 changes: 24 additions & 0 deletions terraform/cloud-run.tf
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,14 @@ resource "google_cloud_run_v2_service" "extract" {
name = "GCS_TRANSFORMED_BUCKET_NAME"
value = google_storage_bucket.transformed_data_bucket.name
}
env {
name = "BIGQUERY_DATASET_ID"
value = google_bigquery_dataset.subreddit_metrics.dataset_id
}
env {
name = "BIGQUERY_TABLE_ID"
value = google_bigquery_table.subreddit_metrics.table_id
}
}
service_account = google_service_account.etl.email
}
Expand Down Expand Up @@ -95,6 +103,14 @@ resource "google_cloud_run_v2_service" "transform" {
name = "GCS_TRANSFORMED_BUCKET_NAME"
value = google_storage_bucket.transformed_data_bucket.name
}
env {
name = "BIGQUERY_DATASET_ID"
value = google_bigquery_dataset.subreddit_metrics.dataset_id
}
env {
name = "BIGQUERY_TABLE_ID"
value = google_bigquery_table.subreddit_metrics.table_id
}
}
service_account = google_service_account.etl.email
}
Expand Down Expand Up @@ -150,6 +166,14 @@ resource "google_cloud_run_v2_service" "load" {
name = "GCS_TRANSFORMED_BUCKET_NAME"
value = google_storage_bucket.transformed_data_bucket.name
}
env {
name = "BIGQUERY_DATASET_ID"
value = google_bigquery_dataset.subreddit_metrics.dataset_id
}
env {
name = "BIGQUERY_TABLE_ID"
value = google_bigquery_table.subreddit_metrics.table_id
}
}
service_account = google_service_account.etl.email
}
Expand Down