3 minute read

Link to Project

Introduction: Why Predict Rainfall?

As I continued to develop my skills as a data scientist, I wanted a project that would allow me to explore both classification and feature engineering. Predicting weather patterns, particularly rainfall, seemed like an excellent challenge. Not only did this project promise to provide useful insights, but it also gave me a chance to apply different machine learning models and evaluate their performance.

This project was part of a Kaggle competition, which provided a real-world context for the problem and a chance to benchmark my results against others in the field. Weather prediction is a complex and highly impactful problem, and though it’s common in data science, it still offers ample opportunity for innovation, especially when it comes to improving accuracy and handling different types of data.

Choosing the Dataset & Preprocessing the Data

For this project, I used the Weather Prediction Dataset, which contains various weather-related variables, including temperature, humidity, wind speed, and cloud cover. The target variable is rainfall, which is binary—either it will rain (1) or it won’t (0).

Once I loaded the data, the preprocessing steps included:

  • Handling missing values by imputing them with the mean or median.
  • Feature scaling to standardize the continuous variables for better model performance.
  • Encoding categorical variables (like wind direction) into numerical representations.
  • Splitting the data into training and testing sets (80/20 split).
  • Feature engineering to create new variables from the existing ones, like creating a “feels like” temperature using temperature and humidity.

I experimented with adding more features, but I ran into complications. The feature engineering process became tangled with a lot of technical debt, making the code harder to maintain. Ultimately, I realized I needed to refactor my code before moving forward, so I decided to focus on the models and try feature engineering again once I had a cleaner codebase.

Model Selection & Training

I tested four different models to predict rainfall:

  • Logistic Regression
  • Random Forest
  • XGBoost
  • Support Vector Classifier (SVC)

While I initially considered SVC and Random Forest as top contenders, the Logistic Regression model outperformed the others in terms of both speed and accuracy. Here’s a quick look at the results:

  • Logistic Regression Accuracy: 89.68% (ROC-AUC)
  • XGBoost Accuracy: 88.32% (ROC-AUC)
  • Random Forest Accuracy: 86.15% (ROC-AUC)

I also wanted to fine-tune the models using hyperparameter optimization, so I employed RandomizedSearchCV and GridSearchCV to find the best parameters. This was an essential step, as fine-tuning can significantly improve model performance. Here’s how the two methods compared:

  • GridSearchCV was helpful in searching exhaustively through a specified parameter grid for the best results, but it took longer to run.
  • RandomizedSearchCV was faster and more efficient in narrowing down the best parameters by sampling a random subset of the parameter space.

Both methods improved model accuracy by a small margin, but GridSearchCV with the Logistic Regression model provided the most noticeable improvement.

Results & Key Takeaways

One of the biggest lessons I learned was that model simplicity can often outperform more complex models when you have well-prepared data. Although XGBoost showed promising results, Logistic Regression was able to give nearly identical performance in this case, with much less computational cost.

The project also taught me a lot about feature engineering. The more I worked with the data, the more I realized how much additional context variables like “feels like” temperature or dew point could improve the model’s predictions. However, the complications from feature engineering led me to delay additional changes until the codebase was in better shape.

Future Improvements

If I were to revisit this project, here are some areas I’d focus on:

  • Experiment with deep learning models, such as neural networks, to capture more complex patterns in the data.
  • Include more external data, such as regional weather patterns or satellite data, to improve model accuracy.
  • Explore different feature selection techniques to remove any irrelevant features.

Final Thoughts

Working on this rainfall prediction project as part of a Kaggle competition was a fantastic way to develop my understanding of classification tasks, feature engineering, and model evaluation. It was a great example of how real-world data science problems require not just technical skills but also creativity and problem-solving.

This project has set the foundation for many more data science projects to come. I’m excited to continue exploring more complex problems, applying machine learning techniques, and improving my skills along the way!