Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper cleanup updates: added type annotation, logging, doc strings, and error handling #141

Merged
merged 12 commits into from
Oct 6, 2024

Conversation

nicolassaw
Copy link
Contributor

Making the scraper clearer and easier to maintain with documentation, logging, error handling, and type annotation

It makes the code a lot longer though.

end_date: Optional[str] = None,
court_calendar_link_text: Optional[str] = None,
case_number: Optional[str] = None
) -> Tuple[int, str, str, str, Optional[str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this tuple type totally works and is very usable, but you might consider implementing a config class to hold all of these in a named way. That might look something like this:

from dataclasses import dataclass

@dataclass
class ScraperConfig:
    ms_wait: Optional[int] = None
    start_date: str | None = None  # this is an alternate way to do the type btw, some people prefer it. Optonal[T] == T | None
    end_date: str | None = None
    ...

Then you can use it like my_config = Config(ms_wait=100, end_date='2024-01-01', ...) and access the fields like my_config.ms_wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sick. Okay. I'll figure this out and implement it in a separate PR for the scraper module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I tried implementing this on the set_defaults piece and got lost in the sauce. I'll have to ask you about this next time we chat.

TypeError: If the provided county name is not a string.
"""
if not isinstance(county, str):
raise TypeError("The county name must be a string.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can definitely do this kind of type guarding, but it's not commonly used unless you expect people to be using the code from outside the module, and you have a strong reason you need a specific type. If you run mypy on the code during build (I can set up a build pipeline btw) it will catch any silly mistakes like passing an int in here, with no runtime check needed. That said, feel free to keep them if you like them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'll remove it to keep things cleaner.

Creates and configures a requests session for interacting with web pages.

This method sets up a `requests.Session` with SSL verification disabled and suppresses
related warnings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the ssl verify an optional parameter, and only disable warnings if the user requests no verification. I believe we can turn verification on, and it's a good practice to do so, but we want to keep the ability to turn it off again if needed.

Copy link
Contributor Author

@nicolassaw nicolassaw Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll turn it back on and test and see what happens. Plus I'll make it an optional parameter

Raises:
OSError: If there is an error creating the directories.
"""
case_html_path = os.path.join(os.path.dirname(__file__), "..", "..", "data", county, "case_html")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, using __file__ to find other files is not recommended. The reason is that if someone wants to import this code and use it, they have to place the input files in some weird directory where your code is, rather than somewhere convenient for them. I would recommend making the path to the data folder an argument to this module, so that people can just supply a path that works for them. You can default to something like ./data, which will be relative to where you run python from, rather than where this source file is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'll make this case_html_path a parameter you can pass it and will default to this directory.


Raises:
Exception: If the county is not found in the CSV file or if required data is missing, an exception is raised
and logged.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a standard format for this kind of docstring, the one I'm familiar with would look like this. The advantage of using this format is that we can run https://www.sphinx-doc.org/en/master/ to make a docs site for our code automatically, and it will do lots of nice things for you if you follow a format it knows.

"""
One line summary

Longer description body, lots of words words
words, multiline potentially.

:param county: the country to parse. (note: no type needed, that's in the type annotation already so the tool will grab it)
:param logger: the logger to use
:returns: a tuple of ... (same thing, no type needed unless it makes the language smoother)
:raises Exception: on these conditions (prefer a subtype of Exception, define your own if you need to, so people have something specific to catch)
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll update all of the doc strings to match this new format.

if not search_page_id:
write_debug_and_quit(
verification_text="Court Calendar link",
page_text=main_page_html,
logger=logger,
)
search_url = base_url + "Search.aspx?ID=" + search_page_id
raise ValueError("Court Calendar link not found on the main page.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just highlighting that this is not actually using a default value if one isn't found, which isn't consistent with the docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof. I'll make sure to fix this. Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually wait. Do you mean not using a default search_url or a default base_url?

src/scraper/__init__.py Outdated Show resolved Hide resolved
src/scraper/__init__.py Outdated Show resolved Hide resolved
date_string = datetime.strftime(date, "%m/%d/%Y")
# loop through each judicial officer

for date in (start_date + timedelta(n) for n in range((end_date - start_date).days + 1)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this kind of nested iterator is confusing to me as a reader, I had to think for a minute to grok this. I think this might be more readable:

for n in range((end_date - start_date).days + 1)):
    date = start_date + timedelta(n)
    ...

)

scraper_instance, scraper_function = self.get_class_and_method(county, logger)
if scraper_instance and scraper_function:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the get_class_and_method can't return None anymore, you can just assume the results are there and remove this if and the error log. The exception will bubble up if it happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! I'll remove it.

@nicolassaw
Copy link
Contributor Author

@Matt343 Hi there Matt! Any more changes or thoughts before approving this?

@nicolassaw nicolassaw merged commit 44f5598 into main Oct 6, 2024
2 checks passed
@nicolassaw nicolassaw deleted the scraper-cleanup-updates branch October 6, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

2 participants