Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract information about shared links (and media) #9

Open
tmaiaroto opened this issue Jul 12, 2014 · 4 comments
Open

Extract information about shared links (and media) #9

tmaiaroto opened this issue Jul 12, 2014 · 4 comments
Labels

Comments

@tmaiaroto
Copy link
Member

Virality Score has a whole Node.js crawler to do this and it even extracts entities. Social Harvest needs a tool like this too...But in Go of course.

I stumbled upon this: https://github.com/advancedlogic/GoOse

Which appears to do some of what I had done in Node.js, but a bit more basic. I think it sounds like a good starting point.

The information extracted from shared links can really tell a lot about what exactly is being shared. It will contain more meaty information to run through various filters to get a sense for topics, etc. Shared links that do end up being crawled for additional information can at least only exist once in the data set. While the link may have been shared a thousand times, we only need the summary/extracted/semantic/meta data from it stored once.

Think about how to reduce duplicate lookups though. Once data is extracted, we don't want to make another 1,000 HTTP requests to get data that's already in the database. The challenge here is we don't know what database will be used. -- Perhaps an ok solution for now is to create a log file with a list of what was already discovered by the particular harvester. This would mean that a different harvester could request a duplicate URL, but at least that's better than making thousands of needless requests.

This is quite easy to get and store...But being efficient about it is going to require some thinking.

@tmaiaroto tmaiaroto added the task label Jul 12, 2014
@harikt
Copy link

harikt commented Jul 12, 2014

Python goose https://github.com/grangier/python-goose/

@tmaiaroto
Copy link
Member Author

Cool, good to know. I don't have any of my Go code checked in yet (it's quite messy), but I will soon. I am currently grabbing and storing data from Facebook with great success. So I have a bunch of shared links coming through that I'll want to get a little more insight on. That said, given Social Harvest is being ported to Go (from PHP and Node.js) I'm looking to keep tools like this in Go. That is not to say someone else couldn't use Python and filter the data that ultimately ends up being stored.

One of the goals of Social Harvest is to allow users to use whatever programming language they want. Any database for that matter too.

@harikt
Copy link

harikt commented Jul 12, 2014

@tmaiaroto interesting. I will ask you to make things as service.

I am not sure whether it is right or wrong approach. But I feel SOA is nice.

Application -> Queues -> Service ( this way the service can be the best fit in any language ) .

We can get the o/p in json or some other way. Looking forward to see your code when pushed. Watching closely.

@tmaiaroto
Copy link
Member Author

For the most part I do plan to work like that. Though I was thinking about streams and piping data through a set of filters. Basically, with fluentd I'm tailing a bunch of log files for all the different types of data. This data, before it hits those log files, should be able to be streamed and filtered. I will likely have to think about a queue though.

Any filter on the data done outside of the core server app will indeed get the data in JSON. Line by line.

I need to pay extra careful attention to this process...so I expect it to not be complete as fast as some other things...But one thing I think that can be done immediately (and always) is filtering via Fluentd.

See: http://docs.fluentd.org/articles/filter-modify-apache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants