Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper Task Definition #33

Open
dennyabrain opened this issue Oct 27, 2020 · 1 comment
Open

Scraper Task Definition #33

dennyabrain opened this issue Oct 27, 2020 · 1 comment

Comments

@dennyabrain
Copy link
Contributor

dennyabrain commented Oct 27, 2020

User Story:

Different Tattle contributors periodically upload their chat backups to a designated folder on a Google Drive owned by Tattle. Tattle Admins should then be able to run a script to download the content of this drive and transform into a desired structure (to be explained later)

Background:

A WhatsApp Group Chat’s content can be backed up on your google drive. This backup is stored in a folder that has the same name as the WhatsApp Group (enforced by a tattle team member). This folder contains :

  1. A .txt file containing timestamped stream of WhatsApp messages AND/OR
  2. image and video files that were part of this group chat

Objective:

Obtain data for every WhatsApp Group in a structured form (JSON preferred) so that it can be stored in a MongoDB. This structured file should contain

  1. the timestamp of the message,
  2. the content of the message
    1. If the message is a text message, this should be a string containing that text
    2. If the message is a image or video, it should contain the path to the file on your local machine
  3. an Anonymized sender id (to obfuscate sender’s phone number)

Current Progress:

I encourage you to read about the various authentication methods that Google offers to programmatically access their services (Drive in our case). In my research, I tried out a few and moved ahead with something that they call Service Accounts.

Check out the functions getFilesInThisFolder(), getFoldersInThisFolder(), getFolderFromDriveByName() here
They contain some examples of how to GET directory and file information from google drive. Hopefully they parameter sent to the drive.files.list() function in my code will serve as documentation of google drive API and save you some time.
You will also authentication related code in that file that might be helpful. In my understanding the challenge with google drive has been figuring out what is the right authentication mechanism for your task. once thats done the process of actually fetching data from google drive is always the same.

you'll also see a reference to a file named '/whatsapp-scraper-668a815fc26f.json'. This was generated for the service account for Tattle's Gmail account. We can send this to you in case you just want to try it out.

Obfuscation phone number related code is here

@tarunima
Copy link
Contributor

The detailed schema:

  • the timestamp of the message,
  • the content of the message:
  1. If the message is a text message, this should be a string containing that text
  2. If the message is a image or video, it should contain the path to the file on your local machine
  • an Anonymized sender id (to obfuscate sender’s phone number). This is unique to a file and not persistent across files.
  • the group name based on the file name

The primary database doesn't have any automated linkages between messages, but in a second production database we can connect messages based on timestamp and phone number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants