Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planet blog feed analysis #4

Open
shakthimaan opened this issue Jan 29, 2018 · 3 comments
Open

Planet blog feed analysis #4

shakthimaan opened this issue Jan 29, 2018 · 3 comments

Comments

@shakthimaan
Copy link

A report is required for dgplug students and planet.dgplug.org to answer:

  1. The number of posts per month
  2. The interval between posts per user
  3. The users who have not posted for over a month
  4. The number of words per blog post per user

If the information can be fed into a database periodically using an application container, a Grafana dashboard can be constructed for the same.

@farhaanbukhsh
Copy link
Member

This seems really interesting, I have a little experience with grafana but let me do a setup and lets see how we can better visualize it.

@farhaanbukhsh
Copy link
Member

farhaanbukhsh commented Feb 2, 2018

So I tried setting up grafana, was able to do this with the docker image that grafana has. I am thinking of using feedparser and give the github raw url to the feedparser of planet pages [1] and [2]. For now I am thinking we could run this script as a cron and generate the data. I have not explored the data source part but I feel a simple MySQL or Postgres can do it, but what I really loved and would like to use here is influxDB [3].

Once data is captured performing queries over it should not be very difficult. My only concern is a neat way to get data for each blog and populate it in influxdb and this should be done incrementally for example what if new blog is updated now I don't want all the information what I want is just the new blog.

I am thinking about writing a service which can listen to such kind of events. Frankly with grafana I feel the visualization is taken care of, the data collection part is the challenge here.

@Schubisu
Copy link
Member

Schubisu commented Feb 9, 2018

@farhaanbukhsh I'm not sure if I understand that correctly;
When using feedparser, to answer the questions from @shakthimaan above, imho you would need to save the following fields:

  • ['source']['id'] -> the blog identifier (since the author field may be empty)
  • ['id'] -> unique blog post identifier
  • a word count
  • ['updated'] -> creation or last update date

if you check your db for the unique post id before inserting data, you're not going to have duplicates. It could also be discussed to link multiple blogs of single authors, as this special case might occur more often. This would however require some manual editing of the db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants