Public Sentiment Collection Agent - Geographic Sentiment Analysis

📋 Project Overview & Problem Statement

Challenge: Traditional sentiment analysis tools lump all geographic regions together, producing misleading global averages that mask critical cultural and regional differences. For example, analyzing "public opinion on alcohol consumption" globally would show a mixed sentiment, completely missing that Saudi Arabia has 95% negative sentiment (religious/cultural) while Germany has 70% positive sentiment (beer culture).

Solution: Public Sentiment Collection Agent uses geographic segmentation, source diversity tracking, and credibility scoring to provide accurate, context-aware sentiment analysis. The system automatically detects data quality issues like echo chamber bias and single-domain concentration.

🚨 Why Geographic Filtering Matters

Example: "Public opinion on alcohol consumption"

❌ WITHOUT geographic filtering: 60% negative, 30% neutral, 10% positive (MISLEADING - lumps all regions together)

✅ WITH geographic filtering:

Saudi Arabia: 95% negative (religious/cultural context)
Germany: 70% positive (beer culture)
USA: 50/50 split (health concerns vs. social acceptance)

Key Benefits

Geographic Segmentation: Analyze sentiment by country/region to reveal cultural nuances
Source Diversity Tracking: Monitor news, social media, forums, blogs across regions
Credibility Scoring: 0-100 score based on source diversity and sample size
Automatic Bias Detection: Warns about echo chambers, single-domain concentration
Comprehensive Output: 10 files per analysis (1 report, 4 charts, 5 CSV exports)

🤖 AI Capabilities & 5-Agent Architecture

🌍 Geographic Social Listening Agent

Collects sentiment data with location-specific filtering using Tavily Search API. Tracks source diversity and issues quality warnings.

🧠 Comparative Sentiment Analysis Agent

Processes data separately for each location using Gemini AI. Calculates credibility scores: (Diversity × 0.6) + (Sample Size × 0.4)

📊 Comparative Visualization Designer Agent

Creates 4 professional charts: regional sentiment comparison, credibility dashboard, source diversity, and theme frequency.

💾 Data Export Agent

Exports 5 CSV files: sentiment distribution, emotion frequency, theme comparison, source attribution, and credibility metrics.

📝 Enhanced Packaging Agent

Generates executive-ready markdown reports with embedded visualizations, data tables, and methodological limitations.

AI Processing Pipeline

Step 1: Geographic data collection with location-based web search
Step 2: Source diversity analysis and bias detection
Step 3: Cultural context-aware sentiment analysis using Gemini AI
Step 4: Credibility scoring and quality assessment
Step 5: Comparative visualization generation
Step 6: Data export to CSV for Excel/Google Sheets
Step 7: Executive report packaging with all insights

🔍 Source Diversity & Credibility Features

Source Type Classification

The system automatically classifies sources into categories:

News: Professional journalism (BBC, CNN, Reuters, local news sites)
Social Media: User-generated content (Reddit, Twitter, Facebook)
Institutional: Government and academic sources (.gov, .edu)
Blogs: Opinion and commentary (Medium, personal blogs)
Forums: Community discussions and Q&A sites

Automatic Bias Warnings

The system issues warnings when data quality is compromised:

⚠️ Over 70% sources are social media (potential echo chamber bias)
⚠️ 60% of sources from single domain: reddit.com
⚠️ Only 4 unique sources (low diversity)
            

Credibility Score Calculation

Credibility Score (0-100) =
    Source Diversity Score × 0.6 +
    Sample Size Score × 0.4

🟢 70-100: High confidence (diverse sources, adequate sample)
🟡 50-69: Medium confidence (some limitations present)
🔴 0-49: Low confidence (significant data quality concerns)
            

📊 Output Package (10 Files Per Analysis)

1 Markdown Report

Executive summary with regional comparison
Sentiment distribution tables by location
Emotion and theme frequency breakdown
Credibility assessment with warnings
Source diversity analysis
Methodological limitations
Actionable recommendations

4 Visualization Charts (PNG)

Regional Sentiment Comparison: Grouped bar chart showing positive/negative/neutral by location
Credibility Dashboard: Horizontal bar charts for credibility scores and sample sizes
Source Diversity Breakdown: Stacked bar chart showing source types by location
Theme Frequency Comparison: Grouped bar chart showing theme mentions across regions

5 CSV Data Exports

sentiment_distribution.csv: Positive/negative/neutral percentages by location
emotion_frequency.csv: Fear, anger, hope, joy, sadness counts per region
theme_comparison.csv: Matrix of theme frequencies across locations
source_attribution.csv: Domain-level source breakdown by location
credibility_metrics.csv: Detailed quality metrics with warnings

🛠️ Technical Architecture & Implementation

AI & Analytics Stack

Google Gemini 2.0 Flash Tavily Search API Python 3.9+ Pandas 2.0+ Matplotlib Seaborn

Multi-Agent Framework

5 Specialized Agents Web Search Integration NLP Sentiment Analysis Data Quality Scoring Auto Visualization

Deployment Options

Google Colab Jupyter Notebook Local Python Streamlit (Optional)

System Architecture

Pipeline Flow:
Geographic Listening → Web search with location filters
Source Analysis → Diversity tracking & bias detection
Sentiment Analysis → Gemini AI with cultural context
Credibility Scoring → Quality assessment (0-100)
Visualization → 4 professional charts
Data Export → 5 CSV files for Excel/Sheets
Report Packaging → Executive markdown report
            

📖 Development Setup & Usage Guide

Quick Start with Google Colab (Recommended)

Open Colab Notebook: Click "Launch in Google Colab" button above
Add API Keys: Add GOOGLE_API_KEY and TAVILY_API_KEY to Colab Secrets (🔑 icon)
Run Setup Cells: Install dependencies and configure APIs
Run Analysis: Execute run_enhanced_sentiment_pipeline() with your topic and locations
Download Results: Get markdown report, 4 charts, and 5 CSV files

Example Usage

# Example: Analyze firecracker ban opinions across cultures
results = run_enhanced_sentiment_pipeline(
    issue_keyword="Should firecrackers and fireworks be banned?",
    locations=["Malaysia", "Germany", "USA", "India"],
    num_sources_per_location=15,
    output_dir="."
)

# Output:
# - comparative_report_20251024_123456.md
# - regional_comparison_20251024_123456.png
# - credibility_dashboard_20251024_123456.png
# - source_diversity_20251024_123456.png
# - theme_frequency_20251024_123456.png
# - sentiment_distribution_20251024_123456.csv
# - emotion_frequency_20251024_123456.csv
# - theme_comparison_20251024_123456.csv
# - source_attribution_20251024_123456.csv
# - credibility_metrics_20251024_123456.csv
            

Required API Keys

Google AI Studio API: Get from Google AI Studio (free tier available)
Tavily Search API: Get from Tavily (free 1,000 searches/month)

📊 Performance Metrics & Business Impact

10-15 min

Full Analysis Time

10

Files Generated

5

Specialized Agents

0-100

Credibility Score

Business Value Demonstration

Addresses Cultural Nuance: Reveals regional differences instead of misleading global averages
Transparent Quality Metrics: Users see exactly how trustworthy each regional analysis is
Prevents Flawed Decisions: Detects and warns about echo chamber bias and data quality issues
Production-Ready Output: Professional charts, CSV exports, and executive reports
Cost-Effective: ~$0.50 per analysis vs. $10,000+ for manual research

Use Cases

Market Research: Understand regional product acceptance before launch
Policy Analysis: Gauge public opinion on regulations by region
Brand Monitoring: Track sentiment across different markets
Competitive Intelligence: Compare regional perception of competitors
Crisis Management: Monitor sentiment evolution during PR incidents

⚠️ Limitations & Disclaimers

Data Collection Limitations

Language Bias: Primarily English-language sources (non-English opinions underrepresented)
Digital Divide: Only captures online populations (offline communities excluded)
Platform Bias: Web search favors certain platforms over others
Temporal: Snapshot in time, sentiment may shift rapidly
Sample Size: Small samples may not represent entire populations

Geographic Filtering Challenges

Geographic attribution is approximate (based on search query modifiers)
Cross-border content may appear in multiple regions
VPNs and global platforms complicate true location detection

Recommended Use

✅ Good for: Directional insights, trend detection, hypothesis generation

⚠️ Caution for: Policy decisions, legal proceedings, precise measurement

❌ Not for: Statistical inference about entire populations

🌍 Public Sentiment Collection Agent