Topic Modelling Project

An end-to-end NLP pipeline designed to automate the categorization of large-scale, unstructured project reports. By implementing a Latent Dirichlet Allocation (LDA) model on World Bank Project Documents, this project replaces labor-intensive manual reviews with a scalable, data-driven thematic discovery system.

The Brief

Target Audience

My Role

Data Architects, Policy Analysts, and Research Leads.

Extracting and filtering English-language project documents from the World Bank dataset to ensure linguistic and semantic consistency.
Developing a rigorous 10-step text engineering pipeline using NLTK and SpaCy, including Unicode noise reduction and lemmatization.
Implementing the LDA model using Gensim to discover latent thematic structures across thousands of unstructured reports.
Executing a data-driven model selection approach by calculating and plotting Coherence ($C_v$) and Perplexity metrics to determine the optimal number of topics.
Architecting an interactive visualization dashboard using PyLDAvis to interpret topic prevalence and term saliency for executive reporting.

The Problem

The Solution

Organizations often struggle with massive volumes of unstructured text reports that are too large for manual categorization. Without a structured way to identify themes, critical insights regarding urban development, financial infrastructure, or environmental policy remain hidden, leading to inefficient resource allocation and missed research trends.

I engineered an automated discovery pipeline that transforms raw text into machine-readable clusters. By moving away from "guesswork" and utilizing a mathematical Elbow Method (Coherence vs. Perplexity), I identified the optimal number of topics that ensured high human interpretability. The final solution allows users to interactively explore the intertopic distance map to assign meaningful labels to thousands of documents instantly.

Process

Results

Data Engineering: Stripped HTML, non-alphabetic characters, and stop words to refine the "signal" within the text.
Thematic Modelling: Leveraged Gensim’s LDA implementation, tuning Alpha and Eta hyperparameters to control topic-document density and word distributions.
Quantitative Validation: Plotted Coherence Scores to identify the peak "semantic similarity" point, ensuring that the words within each topic were logically related.
Interactive Interpretation: Used PyLDAvis to visualize the prevalence and distinctness of each topic, ensuring clear thematic boundaries (e.g., separating Financial Infrastructure from Environmental Policy).

Scalable Categorization: Successfully converted 100% of the unstructured corpus into clear, labeled categories, reducing manual audit time by an estimated 90%.
Mathematical Convergence: Proved model validity through a dual-metric approach, balancing low Perplexity (mathematical fit) with high Coherence (human meaning).
High Intertopic Distance: Achieved distinct, non-overlapping clusters in the PyLDAvis map, confirming the model effectively captured unique themes without redundancy.
Actionable Discovery: Identified top salient terms for key sectors, providing a structured foundation for further trend analysis and executive decision-making

Tech Stack

Language: Python
NLP & Modeling: Gensim, NLTK, SpaCy
Data Science: Pandas, NumPy, Regex
Visualization: PyLDAvis, Matplotlib, Seaborn

Project Gallery

View the Source Code

View on GitHub

↑ Back to top