GitHub Scraper

A Python-based web scraper that collects GitHub developer information, their followers, and repository details using Selenium and stores the data in a MySQL database.

Features

Scrapes trending developers across multiple programming languages
Collects follower information (up to 1000 per developer)
Gathers repository details including name, URL, description, language, stars, and forks
Supports authentication via cookies or username/password
Stores data in a MySQL database with automatic schema creation
Includes error handling and logging
Follows clean architecture principles

Project Structure

github-toolkit/
├── config/
│   └── settings.py           # Configuration and environment variables
├── core/
│   ├── entities.py          # Domain entities
│   └── exceptions.py        # Custom exceptions
├── infrastructure/
│   ├── database/           # Database-related code
│   │   ├── connection.py
│   │   └── models.py
│   └── auth/              # Authentication service
│       └── auth_service.py
├── services/
│   └── scraping/          # Scraping services
│       ├── github_developer_scraper.py
│       └── github_repo_scraper.py
├── utils/
│   └── helpers.py         # Utility functions
├── controllers/
│   └── github_scraper_controller.py  # Main controller
├── main.py                # Entry point
└── README.md

Prerequisites

Python 3.8+
MySQL database
Chrome browser
Chrome WebDriver

Installation

Clone the repository:

git clone https://github.com/yourusername/github-scraper.git
cd github-scraper

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Create a .env file in the root directory with the following variables:

GITHUB_USERNAME=your_username
GITHUB_PASSWORD=your_password
DB_USERNAME=your_db_username
DB_PASSWORD=your_db_password
DB_HOST=your_db_host
DB_NAME=your_db_name

Create a config directory:

mkdir config

Requirements

Create a requirements.txt file with:

selenium
sqlalchemy
python-dotenv

Usage

Run the scraper:

python main.py

The scraper will:

Authenticate with GitHub
Scrape trending developers for specified languages
Collect their followers (up to 1000 per developer)
Scrape their repositories
Store all data in the MySQL database

Configuration

Modify config/settings.py to change:
- LANGUAGES: List of programming languages to scrape
- USE_COOKIE: Toggle between cookie-based and credential-based authentication
Adjust sleep times in services if needed for rate limiting

Database Schema

github_users

id (PK)
username (unique)
profile_url
created_at
updated_at
published_at

github_repos

id (PK)
username
repo_name
repo_intro
repo_url (unique)
repo_lang
repo_stars
repo_forks
created_at
updated_at
published_at

Error Handling

Custom exceptions for authentication, scraping, and database operations
Logging configured at INFO level
Graceful shutdown of browser instance

Contributing

Fork the repository.
Create a feature branch (git checkout -b feature/your-feature).
Commit changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details (create one if needed).

Acknowledgments

Built with Selenium, SQLAlchemy, and Python.
Inspired by the need to automate GitHub data collection.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Scraper

Features

Project Structure

Prerequisites

Installation

Requirements

Usage

Configuration

Database Schema

github_users

github_repos

Error Handling

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

trinhminhtriet/github-toolkit

Folders and files

Latest commit

History

Repository files navigation

GitHub Scraper

Features

Project Structure

Prerequisites

Installation

Requirements

Usage

Configuration

Database Schema

github_users

github_repos

Error Handling

Contributing

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages