This repository collects and analyzes short-form videos about parenting timeout strategies and menopause-related supplements using automated Google Search scraping and AI-powered video analysis.
The project consists of two main components:
- Google Search Scraper (
run_googlesearch.py) - Automatically scrapes Google search results for short videos - Video Analysis (
batch_LLM.py) - Processes videos using a large language model to extract structured information
uv pip install -r requirements.txtpip install -r requirements-googlesearch.txtRequired Python packages:
- pandas
- tqdm
- selenium
- undetected-chromedriver
System packages (for Tor support):
- tor
- torsocks
- chromium/chrome browser
This script scrapes Google search results for short videos related to:
- Timeout dataset:
#parenting #timeoutand#gentleparenting #timeout - Supplements dataset:
#menopause #supplementsand#menopause #vitamins
# Normal execution (scrapes both datasets)
python3 run_googlesearch.py
# With Tor (if IP blocked)
torsocks python3 run_googlesearch.py
# Or use the --use-tor flag
python3 run_googlesearch.py --use-tor- Searches Google for the specified hashtag combinations
- Clicks on "Short videos" filter
- Scrolls to load more results and clicks "More results" button
- Scrapes video metadata: link, duration, title, source, author
- Combines results with previous data
- Filters to include only Instagram, TikTok, YouTube, and Facebook videos
- Saves results to CSV and text files
Timeout dataset:
timeout.csv- Full results with columns: link, duration, title, source, authortimeout_links.txt- Just the links, one per line
Supplements dataset:
supplements.csv- Full results with columns: link, duration, title, source, authorsupplements_links.txt- Just the links, one per line
The scraper includes automatic protection against IP blocking:
- First attempts to run without Tor
- If that fails (likely due to IP blocking), automatically retries with Tor/torsocks
- Tor provides anonymity and helps avoid rate limiting
The repository includes a GitHub Actions workflow (.github/workflows/googlesearch.yml) that:
- Runs daily at 2 AM UTC (configurable via cron schedule)
- Can also be triggered manually via GitHub Actions UI
- Automatically commits updated CSV files back to the repository
To trigger manually:
- Go to the "Actions" tab in GitHub
- Select "Google Search Scraper" workflow
- Click "Run workflow"
To change the schedule, edit the cron expression in .github/workflows/googlesearch.yml:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM UTCCommon cron schedules:
0 */6 * * *- Every 6 hours0 0 * * 0- Weekly on Sunday at midnight0 0 1 * *- Monthly on the 1st at midnight
This script processes downloaded videos using the Qwen3-Omni-30B-A3B-Instruct multimodal model to extract structured information.
- Downloaded videos (see "Downloading Videos" section below)
- GPU with sufficient VRAM (approximately 78GB required)
transformerslibrary and Qwen dependencies
# Process timeout dataset
python3 batch_LLM.py --dataset timeout
# Process supplements dataset
python3 batch_LLM.py --dataset supplements- Scans the
{dataset}_videos/folder for video JSON metadata files - Skips videos that have already been processed (result file exists)
- For each video:
- Loads video file and metadata
- Sends video and context to the LLM with a dataset-specific prompt
- Extracts structured information based on the dataset
- Saves results to
{dataset}_results/folder as JSON files
For timeout videos:
- Video description and transcript
- Tone and language
- Whether video discusses timeout as a parenting strategy
- Parenting approach shown
- Target child age range
- Speaker's profession
- Sentiment (positive/neutral/negative toward timeout)
- Criticisms of timeout
- Alternative strategies mentioned
- Relevance to ASD, ADHD, anxiety
- Usefulness, misleading content, and quality ratings
- Personal experiences shared
For supplements videos:
- Video description and transcript
- Tone and language
- Supplements, vitamins, or medications mentioned
- Active ingredients
- Symptoms or conditions addressed
- Whether targeted at menopause
- Speaker's profession
- Sentiment (positive/neutral/negative toward supplements)
- Criticisms of supplements
- Alternative strategies mentioned
- Usefulness, misleading content, and quality ratings
- Personal experiences shared
Results are saved as JSON files in:
timeout_results/- Analysis results for timeout videossupplements_results/- Analysis results for supplements videos
To download videos from the collected links, use yt-dlp:
# Download timeout videos
yt-dlp --write-info-json --batch-file timeout_links.txt --paths timeout_videos
# Download supplements videos
yt-dlp --write-info-json --batch-file supplements_links.txt --paths supplements_videosThis downloads:
- Video files to
{dataset}_videos/ - Metadata JSON files (
.info.json) with video information
run_googlesearch.py- Python script for Google search scraping (headless mode)batch_LLM.py- Video analysis script using Qwen3-Omni modelgooglesearch.ipynb- Original Jupyter notebook with scraping logictest_LLM.ipynb- Jupyter notebook for testing LLM analysisjoin_results.ipynb- Jupyter notebook for combining and analyzing results.github/workflows/googlesearch.yml- GitHub Actions workflow for automated scrapingrequirements.txt- Python dependencies for video analysisrequirements-googlesearch.txt- Python dependencies for scraping
timeout.csv/timeout_links.txt- Collected timeout video links and metadatasupplements.csv/supplements_links.txt- Collected supplements video links and metadatatimeout_LLM_results.xlsx- Analyzed results for timeout videossupplements_LLM_results.xlsx- Analyzed results for supplements videos
See LICENSE file for details.