Project: GreySec Data Scraper Program
#1
I had the idea to make a data scraper for greysec specifically. I'm writing it in python, so if anyone knows python then DM me or reply to this thread. I could use the help (just in the first 15 minutes I found out that parsing the user profile pages is a pain in the butt). For those of you who don't know, a scraper gets various information from a data source. In this case the scraper will get different information from the greysec website. Stuff like user profile statistics and information, thread statistics, etc. Even if you can't code in python, there may be a way you can help. Just let me know.
Reply
#2
(04-26-2020, 04:54 AM)Dismal_0x8 Wrote: I had the idea to make a data scraper for greysec specifically. I'm writing it in python, so if anyone knows python then DM me or reply to this thread. I could use the help (just in the first 15 minutes I found out that parsing the user profile pages is a pain in the butt). For those of you who don't know, a scraper gets various information from a data source. In this case the scraper will get different information from the greysec website. Stuff like user profile statistics and information, thread statistics, etc. Even if you can't code in python, there may be a way you can help. Just let me know.

This was an old project that was somewhat abandoned when it got to the machine learning stage.
In that time, the spam filter was not good enough and the scraper was used to get a lot of data to later input to the model.
The scrapper works.
https://github.com/memoriasIT/MyBB-ML-Sp...inktext.py

This one was another project based on a scraper. It is able to get data from PMs, recent threads and the spam filter project was planned to be implemented here too.
I don't remember if SMTP notifications were set up, but that was the goal, to have that running in the background and receive notifications on your phone/mail.
https://github.com/memoriasIT/MyBB-Automation

It would be cool to scrape user profile statistics too and create graphs with the data Smile
Reply
#3
enmafia2 Wrote:This was an old project that was somewhat abandoned when it got to the machine learning stage.
In that time, the spam filter was not good enough and the scraper was used to get a lot of data to later input to the model.
The scrapper works.
Cool. I have something to go off of now. Thanks for letting me know.
Reply
#4
Cool idea! Just let me know before you put up the bot, on which account it's being used Smile I only allow approved bots.

Really need to go through the site. We have a lot of bots, added a robots.txt to disallow some agressive chinese bots; daily idle users went down from 70 to 40. Also just banned two bots thats been lurking for almost over a year here. It's nice to see things like this, statistics and stuff. But I have mixed feelings about threat intelligence firms archiving our whole site (Which I know some firms have).
Reply
#5
Yeah of course I'd tell you first. I was actually going to possibly put some form of rate limiting in the bot so you wouldn't have 50 people making 10 requests a second. Actually I wasn't planning on making a seperate account for the scraper. I was thinking users would just use their own accounts.
Reply
#6
As for actually doing this. Could check out python libraries like Selenium. Very useful for web automatization.
Reply
#7
I've heard of selenium but never used it. At the moment I'm using requests and beautifulsoup. I've been working with APIs a lot lately, so having to pick what I want out of html is more challenging lol
Reply
#8
(04-28-2020, 02:29 PM)Dismal_0x8 Wrote: I've heard of selenium but never used it. At the moment I'm using requests and beautifulsoup. I've been working with APIs a lot lately, so having to pick what I want out of html is more challenging lol

Selenium is pretty good specially when you are parsing data from websites that require human interaction, usually I tend to go with the route you are taking (requests and beautifulsoup) as it is easier and faster in my opinion (and doesn't require a driver open).

When you want to work with html using the browser inspect element and select css element usually works really well.
I might try to write something about this as I devote a lot of my time writing stupid scrappers and I could share a couple of tips Smile
Reply
#9
Personally i enjoy using Selenium. The downside is that Selenium is somewhat slower since we are obliged to provide instructions to Geckodriver or the Chromedriver(depending on your preference).The upside is that you can do everything a browser does and more plus have it automated of course. It comes down to a matter of preference.

May i ask what sort of data you are looking to scrape specifically? And out of interest, once you have the data you are looking for, will you be performing some kind of analysis on it? Linguistic, statistical or otherwise?

Depending on the nature of the project we could collaborate within the purview of GS Devs with a dedicated thread and a GS Dev repo over at the Official Github page.

Just some thoughts, let me know what you think.
Reply
#10
(04-30-2020, 12:48 PM)Vector Wrote: Personally i enjoy using Selenium. The downside is that Selenium is somewhat slower since we are obliged to provide instructions to Geckodriver or the Chromedriver(depending on your preference).The upside is that you can do everything a browser does and more plus have it automated of course. It comes down to a matter of preference.

May i ask what sort of data you are looking to scrape specifically? And out of interest, once you have the data you are looking for, will you be performing some kind of analysis on it? Linguistic, statistical or otherwise?

Depending on the nature of the project we could collaborate within the purview of GS Devs with a dedicated thread and a GS Dev repo over at the Official Github page.

Just some thoughts, let me know what you think.
I'm looking to scrape the information you can get from visiting a user profile page to start out. From there it can branch out to threads and posts from users. I'll be doing more statistical analysis. No linguistics analysis are planned at the moment. This project is somewhat on hold at the moment. I have something else I'm working on dealing with the thread contest.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  GreySec Communication Channels (IRC.. etc) Insider 17 21,570 08-09-2020, 09:59 PM
Last Post: Insider
  How did you find GreySec? Insider 11 2,844 07-28-2020, 01:19 AM
Last Post: QMark
  GreySec Social thread. Vector 163 131,059 06-11-2020, 11:25 PM
Last Post: Insider
  GreySec Scraper Tool Release DeepLogic 4 1,644 06-09-2020, 12:20 AM
Last Post: Insider