📄 AI PDF Summarizer

Core Source Code & Document Processing Implementation

Python 3.9 Streamlit Gemini AI PDF Processing

🔍 About This Code Showcase

This curated code snippet demonstrates how the AI PDF Summarizer extracts, processes, and intelligently summarizes complex PDF documents using advanced NLP techniques.

Full deployment scripts, API integrations, and proprietary details are omitted for clarity and security. This showcase highlights the core document processing and AI summarization algorithms.

📖 Core Algorithm: Document Intelligence Engine

The foundation of the AI PDF Summarizer is its ability to extract meaningful content from PDFs, understand document structure, and generate intelligent summaries tailored to user needs:

📄 document_processor.py
import PyPDF2 import google.generativeai as genai from typing import List, Dict, Optional import re class DocumentProcessor: """ Advanced PDF processing engine that extracts, analyzes, and summarizes complex documents while preserving important context and structure. """ def __init__(self, api_key: str): genai.configure(api_key=api_key) self.model = genai.GenerativeModel('gemini-pro') # Pre-configured summarization strategies for different document types self.summarization_strategies = { 'research_paper': self._research_paper_strategy, 'business_report': self._business_report_strategy, 'technical_manual': self._technical_manual_strategy, 'legal_document': self._legal_document_strategy, 'general': self._general_strategy } def process_pdf_intelligent(self, pdf_file, summary_type: str = 'general') -> Dict: """ Process PDF with intelligent content analysis and context-aware summarization. Args: pdf_file: Uploaded PDF file object summary_type: Type of document for specialized processing Returns: Dictionary containing extracted text, document metadata, and AI summary """ # Step 1: Extract raw content with structure preservation extracted_content = self._extract_structured_content(pdf_file) # Step 2: Analyze document characteristics and type document_analysis = self._analyze_document_structure(extracted_content) # Step 3: Apply intelligent chunking for large documents content_chunks = self._intelligent_chunking(extracted_content, document_analysis) # Step 4: Generate context-aware summary using appropriate strategy strategy = self.summarization_strategies.get(summary_type, self._general_strategy) summary = await strategy(content_chunks, document_analysis) return { 'original_text': extracted_content['full_text'], 'document_metadata': document_analysis, 'summary': summary, 'key_insights': summary['insights'], 'processing_stats': self._generate_processing_stats(extracted_content, summary) } def _extract_structured_content(self, pdf_file) -> Dict: """ Extract content while preserving document structure, headers, and formatting. This enables better understanding of document hierarchy and importance. """ pdf_reader = PyPDF2.PdfReader(pdf_file) content_structure = { 'full_text': '', 'pages': [], 'headers': [], 'sections': [], 'metadata': pdf_reader.metadata } for page_num, page in enumerate(pdf_reader.pages): page_text = page.extract_text() # Detect headers and section breaks using typography patterns headers = self._detect_headers(page_text) sections = self._identify_sections(page_text, headers) content_structure['pages'].append({ 'page_number': page_num + 1, 'text': page_text, 'headers': headers, 'sections': sections }) content_structure['full_text'] += page_text + '\n\n' content_structure['headers'].extend(headers) content_structure['sections'].extend(sections) return content_structure def _intelligent_chunking(self, content: Dict, analysis: Dict) -> List[Dict]: """ Smart content chunking that respects document structure and context. Prevents breaking related concepts across chunks for better AI processing. """ chunks = [] current_chunk = '' max_chunk_size = 4000 # Optimal for AI processing for section in content['sections']: section_text = section['content'] # If adding this section would exceed limit, finalize current chunk if len(current_chunk) + len(section_text) > max_chunk_size: if current_chunk: # Don't add empty chunks chunks.append({ 'content': current_chunk, 'section_headers': self._extract_chunk_headers(current_chunk), 'importance_score': self._calculate_importance(current_chunk, analysis) }) current_chunk = section_text else: current_chunk += '\n\n' + section_text # Add final chunk if current_chunk: chunks.append({ 'content': current_chunk, 'section_headers': self._extract_chunk_headers(current_chunk), 'importance_score': self._calculate_importance(current_chunk, analysis) }) # Sort chunks by importance for prioritized processing return sorted(chunks, key=lambda x: x['importance_score'], reverse=True)

🧠 Advanced Summarization Engine

The summarization engine uses specialized strategies for different document types, ensuring optimal results for research papers, business reports, and technical manuals:

📄 summarization_engine.py
class SummarizationEngine: async _research_paper_strategy(self, chunks: List[Dict], analysis: Dict) -> Dict: """ Specialized summarization for academic research papers. Focuses on methodology, findings, and implications. """ # Identify key academic sections abstract_chunk = self._find_section(chunks, ['abstract', 'summary']) methodology_chunk = self._find_section(chunks, ['methodology', 'methods', 'approach']) results_chunk = self._find_section(chunks, ['results', 'findings', 'outcomes']) conclusion_chunk = self._find_section(chunks, ['conclusion', 'discussion', 'implications']) # Generate comprehensive research summary summary_prompt = f""" Analyze this research paper and provide a comprehensive academic summary: Document Analysis: {analysis} Focus on: 1. Research question and objectives 2. Methodology and experimental design 3. Key findings and statistical significance 4. Limitations and future research directions 5. Practical implications and applications Provide citations to specific sections where possible. """ research_summary = await self._generate_ai_summary( self._combine_priority_chunks([abstract_chunk, methodology_chunk, results_chunk, conclusion_chunk]), summary_prompt ) return { 'summary_type': 'research_paper', 'main_summary': research_summary, 'insights': await self._extract_research_insights(chunks), 'key_sections': { 'methodology': methodology_chunk['content'][:500] if methodology_chunk else None, 'findings': results_chunk['content'][:500] if results_chunk else None } } async _business_report_strategy(self, chunks: List[Dict], analysis: Dict) -> Dict: """ Business-focused summarization emphasizing key metrics and actionable insights. Optimized for executive summaries and strategic decision-making. """ # Identify business-critical sections executive_summary = self._find_section(chunks, ['executive summary', 'overview']) financial_data = self._find_section(chunks, ['financial', 'revenue', 'performance']) recommendations = self._find_section(chunks, ['recommendations', 'action items', 'next steps']) business_prompt = f""" Summarize this business document for executive review: Document Type: {analysis.get('document_type', 'Business Report')} Extract and highlight: 1. Key performance indicators and metrics 2. Strategic recommendations and action items 3. Risk factors and opportunities 4. Financial implications and ROI 5. Timeline and implementation priorities Format for executive consumption with clear bullet points. """ business_summary = await self._generate_ai_summary( self._combine_priority_chunks([executive_summary, financial_data, recommendations]), business_prompt ) return { 'summary_type': 'business_report', 'main_summary': business_summary, 'insights': await self._extract_business_insights(chunks), 'action_items': await self._extract_action_items(chunks), 'metrics': await self._extract_key_metrics(chunks) } def _analyze_document_structure(self, content: Dict) -> Dict: """ Analyze document characteristics to determine optimal processing strategy. Uses pattern recognition to classify document type and complexity. """ full_text = content['full_text'].lower() headers = [h.lower() for h in content['headers']] # Pattern matching for document type classification doc_type = 'general' if any(keyword in full_text for keyword in ['abstract', 'methodology', 'references', 'citation']): doc_type = 'research_paper' elif any(keyword in full_text for keyword in ['revenue', 'quarterly', 'executive summary', 'roi']): doc_type = 'business_report' elif any(keyword in full_text for keyword in ['installation', 'configuration', 'user manual', 'api']): doc_type = 'technical_manual' return { 'document_type': doc_type, 'page_count': len(content['pages']), 'word_count': len(content['full_text'].split()), 'complexity_score': self._calculate_complexity(content), 'section_count': len(content['sections']), 'has_structured_content': len(headers) > 3 }

⚙️ Technical Implementation Notes

Key Algorithms & Innovations

Why This Approach Works