Text and Social Analytics Project

An end-to-end NLP pipeline designed to transform thousands of unstructured e-commerce reviews into strategic business insights. By automating the extraction and analysis of customer feedback for leading water bottle brands (Owala, Stanley, Yeti) from different e-commerce platforms (Target, Amazon, Walmart), this system identifies brand loyalty drivers and product pain points that manual review could never scale to reach.

The Brief

Target Audience

Brand Managers, Product Quality Teams, and E-commerce Strategy Executives.

The Problem

Modern e-commerce platforms use dynamic JavaScript and anti-scraping measures that often return 403 Forbidden errors to traditional scrapers. Furthermore, the massive volume of unstructured "status brand" reviews makes manual auditing impractical, causing companies to miss critical functional defects—such as leakage or poor insulation—until high levels of customer dissatisfaction are reached.

Process

1. Automated Data Extraction:

Designed and executed Python scripts using SeleniumBase to simulate authentic user behavior, effectively bypassing 403 Forbidden errors on dynamic platforms like Target.com.
Automated page scrolling and interaction with JavaScript "Show More" elements to extract comprehensive review data, including titles, text, and star ratings.

2. Dataset Orchestration & Standardization:

Orchestrated the consolidation of individual datasets for brands like Owala, Stanley, Yeti, and Hydro Flask into a single master CSV.
Standardized the data schema (Product Name, Review Text, Rating, Brand) to ensure consistency across different e-commerce sources.

3. Advanced Text Engineering:

Implemented a rigorous 10-step preprocessing pipeline using NLTK and Regex, featuring contraction expansion, noise removal (emojis/URLs), and lemmatization.
Engineered Bigrams within the TF-IDF vectorization process to capture context-sensitive phrases like "not cold" or "leaks everywhere".

4. Model Selection & Evaluation:

Executed an evaluation strategy using a 95/5 split, training on Dataset A and performing final assessments on a strictly unseen Dataset B.
Tuned a Support Vector Machine (SVM) model, prioritizing Negative Recall (0.60) and F1 (Negative) of 0.6667 to ensure the most critical customer complaints were not missed

Tech Stack

Language: Python
Web Scraping: BeautifulSoup / Selenium (for handling dynamic content)
Data Science: Pandas, NumPy, Scikit-Learn
NLP Frameworks: NLTK (Tokenization, Lemmatization, Stop-word removal)
Visualization: Matplotlib, Seaborn (for metric comparison and confusion matrices)

My Role

Extracting Hydro Flask reviews from dynamic Target.com e-commerce environments using Python web scraping.
Integrating multi-brand datasets (Owala, Stanley, Yeti, and Hydro Flask) from team members into a unified master corpus.
Developing the NLP preprocessing pipeline to clean and standardize the consolidated dataset for modeling.
Implementing the sentiment classification model using Scikit-Learn with a focus on Tuned SVM and Bigram Inclusion.
Assisting with model evaluation by comparing performance metrics like Macro-F1 and Negative Recall to identify the best business-aligned model.

The Solution

I engineered a resilient, automated system to extract high-quality review data from dynamic web environments. After orchestrating the integration of team-wide data, I developed a 10-step NLP preprocessing engine and a Bigram-based SVM model. While Bigrams increased complexity, the team ultimately prioritized a model that maximized Negative Recall, ensuring that the most critical "complaint" reviews were flagged automatically for business intervention.

Results

Targeted Detection: Successfully achieved 98% Recall for Positive Reviews, identifying brand advocates with high precision.
Actionable Minority Insights: Despite a heavily imbalanced dataset (90% positive), the model effectively flagged 60% of negative reviews, providing a foundation for automated product quality triggers.
Operational Recommendations: Proposed a "Real-Time Negative Sentiment Trigger" to notify Product Quality teams immediately when functional failures (like insulation or leakage issues) are detected.

Project Gallery

View the Source Code

View on GitHub

↑ Back to top