-
-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[subscribestar] Refactoring extractor and handling audio content #5580
base: master
Are you sure you want to change the base?
Conversation
WyohKnott
commented
May 11, 2024
- New support for embedded audios
- New support for external links compatible with yt-dlp
- Add a content_type field at the post level for directory creation
- Major rework of the logic
- Added a check_if_supported_by_ytdlp helper function in util.py for yt-dlp external links handling
"content" : (extr( | ||
'<div class="post-content', '<div class="post-uploads') | ||
.partition(">")[2]), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We remove this as this is the same as the base class.
break | ||
else: | ||
content_type = "image" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this part is to check what type of content the post contains to add a content_type to the post data so we can use it in directory name. This is not perfect as a post could probably contain multiple content_type, but I do not have enough samples to test.
"link": ('data-href="', '"', self._process_media_item), | ||
"audio": ('<source src="', '" type="audio/', | ||
self._process_media_item), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we define 4 type of content types:
- gallery : already handled before
- attachments : already handled before
- link: a new type to extract links from posts bodies
- audio: if we a post has embedded audio.
For each type, we have:
- the detection rules begin
- the dtection rules end
- the function that will process the content returned
if segment[key]: | ||
content = processor(segment, key) | ||
if content: | ||
media.append(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We send the extracted text to the processor and if the return is not None, we append it to media.
for media in gallery_list: | ||
if "/previews" in media["url"]: | ||
self._warn_preview() | ||
return {"url": media["url"], "type": media_type} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gallery processing, not much changed here.
"name": text.unescape(text.extr(item, 'doc_preview-title">', "<")), | ||
"url": text.unescape(text.extr(item, 'href="', '"')), | ||
"type": media_type, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attachment processing, not much changed here. I haven't been able to test as I have never seen this type of content.
item[media_type]): | ||
return {"url": "ytdl:" + item[media_type], "type": media_type} | ||
elif media_type == "audio": | ||
return {"url": item[media_type], "type": media_type} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we process our new handled type:
- if link and if downloadable by yt-dlp, then we append it
- if audio, we append it
"link": True, | ||
"audio": True, | ||
} | ||
media = self._extract_media(html, media_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have rewritten the function to split it in multiple chunks to handle each media type, instead of having a big one.
So one VA artist I sub too was pwned from Patreon and moved to SubscribeStar. Having no experience with SubscribeStar I've searched for a scrapper, starting with my faithful gallery-dl. However, it seems that the SubscribeStar backend didn't handle audio content embedded in posts, which for scrapping a VA artist is quite problematic. Moreover, this VA artist also embedded Google Drive links to some audios, and I am too lazy to download them manually. I'd rather spend an evening rewriting the backend instead. It works for that one artist I subbed too, but I have not other reference point to check. I don't know for example if gallery type content can contain both videos and pictures, for example. Anyway, it works well so far. |
- New support for embedded audios - New support for external links compatible with yt-dlp - Add a content_type field at the post level for directory creation - Major rework of the logic - Added a check_if_supported_by_ytdlp helper function in util.py for yt-dlp external links handling
259842f
to
e5e752d
Compare