3 minute read

Link to Project

Introduction: Why Sentiment Analysis?

As part of my roadmap to becoming a data scientist, I knew I needed to tackle challenging projects that would help me build a strong foundation in machine learning. I wanted something that would not only solidify my understanding but also impress others when showcasing my skills.

I came across a video by Infinite Codes titled 22 Machine Learning Projects That Will Make You A God At Data Science, which laid out a plan for foundational projects. One project that caught my attention was building a Sentiment Analysis System. Through it, I would gain experience using the Natural Language Toolkit (NLTK), understand concepts like stop words, tokenization, and lemmatization, and dive into the fundamentals of Natural Language Processing (NLP).

Choosing the Dataset & Preprocessing the Data

For this project, I selected the Sentiment140 dataset, a well-documented, pre-labeled dataset that is beginner-friendly. The dataset contains tweets labeled as positive or negative, making it ideal for training a binary sentiment classifier.

Once I ensured that the data was clean and ready for processing, I followed these steps:

  • Convert text to lowercase to maintain consistency.
  • Remove unnecessary characters like punctuation, numbers, mentions (@’s), URLs, and special symbols.
  • Handle contractions (e.g., “can’t” → “cannot”) to standardize text.
  • Remove stopwords (common words like “the” and “is” that don’t add much meaning).
  • Tokenization (split text into individual words).
  • Lemmatization/Stemming (reduce words to their base form, e.g., “running” → “run”).

Luckily, thanks to the power of NLTK, these preprocessing steps were straightforward, and I didn’t run into any major issues.

Model Selection & Training

I wanted to test four models:

  • Naïve Bayes (NB)
  • Logistic Regression (LR)
  • Support Vector Classifier (SVC)
  • Random Forest Classifier (RFC)

However, I quickly ran into trouble with SVC and RFC—both took far too long to fit the data. Given the size of my dataset and the need for efficiency, I had to abandon them as options.

This left me with Naïve Bayes and Logistic Regression for comparison. Here’s how they performed:

  • Naïve Bayes Accuracy: 75.85%
  • Logistic Regression Accuracy: 77.34%

A ~1.5% improvement may not seem huge, but it was enough to convince me that Logistic Regression was the better choice. Additionally, Logistic Regression is more interpretable than Naïve Bayes, making it easier to explain.

To further optimize my model, I implemented GridSearchCV to fine-tune the hyperparameter C in Logistic Regression. The best value turned out to be C=10, which slightly improved accuracy to 77.49%—a minor but notable enhancement.

Results & Key Takeaways

Going into this project, I had unrealistic expectations—I assumed I could achieve 90%+ accuracy with ease. However, I quickly realized that real-world NLP is much trickier than I anticipated.

Upon reviewing the dataset, I noticed instances where tweets were mislabeled (e.g., positive tweets marked as negative and vice versa). This suggests that data quality played a role in limiting accuracy.

Despite this, I’m proud of my first NLP project. It introduced me to fundamental concepts, gave me hands-on experience, and laid the groundwork for future improvements.

Future Improvements

If I revisit this project, I’d like to:

  • Experiment with deep learning (e.g., LSTMs or Transformer models like BERT).
  • Improve data preprocessing, possibly implementing sentiment-aware text cleaning.
  • Address class imbalance (if present) to improve model generalization.
  • Try more advanced feature extraction techniques beyond simple bag-of-words or TF-IDF.

Final Thoughts

Diving into NLP through this project has been a game-changer for my learning journey. If you’re an aspiring data scientist, I highly recommend trying sentiment analysis. It’s a fantastic way to understand text preprocessing, feature engineering, and model evaluation while working on a real-world application.

This is just the beginning of my NLP journey, and I’m excited to explore more advanced techniques in the future!