Why I Started This

India's electricity grid is one of the most complex in the world. It's not one grid — it's five regional grids (Northern, Southern, Eastern, Western, and Northeastern), each comprising dozens of states with wildly different demand profiles, climates, and consumption patterns. Gujarat has over 19 MU of mean forecast error at the state level. Nagaland has 0.15 MU. These aren't just different scales — they're different problems.

Forecasting electricity demand in India accurately matters for real reasons: grid operators need to schedule generation ahead of time, extreme heat waves cause sudden demand spikes that strain infrastructure, and poor forecasts lead to either wasteful over-generation or rolling blackouts.

There were notebooks. There were dashboards. But there wasn't a tool that combined ML forecasting + weather risk visualization + natural language explanations into something a non-technical analyst could actually use.

That's what I set out to build.


Stage 0: The Data Problem (September 13, 2025)

The first commit was almost nothing — a data_scraping.py file and an empty notebook. Before any modeling could happen, I needed data: historical electricity demand by state, and historical weather data for each state capital.

data_scraping.py was a 118-line script that pulled electricity demand data from India's POSOCO (Power System Operation Corporation) datasets. The script handled pagination, state-wise filtering, and basic cleaning.

Weather data came from ERA5 reanalysis — daily temperature (min/mean/max at 2m), dewpoint temperature, wind components (u and v at 10m), surface solar radiation downwards, total cloud cover, and UTCI (Universal Thermal Climate Index). All of it in Kelvin, as ERA5 provides.

The pairing of demand and weather, state by state, from 2013 to 2022, became the training dataset.


Stage 1: The First Model (September 17, 2025)

Four days after the initial commit, Stage 1 was done. The output: xgb1_model.pkl and a sample_cleaned_dataset.csv.

The model was an XGBoost regressor trained on the combined demand+weather dataset. The feature set at this point was simpler — temperatures, calendar features (day of week, month, year, is_weekend), and state encoding.

But even at Stage 1, I'd already discovered something important: electricity demand in India doesn't follow a simple temperature curve. In northern states like Rajasthan, scorching summers drive air conditioning demand through the roof. In Kerala, the relationship is weaker — the state has a more moderate coastal climate, and the demand pattern is flatter. Any model that treated "India" as one entity would be wrong for everyone.

Stage 1 ended with a working model. No UI. No inference pipeline. Just a trained file and a cleaned dataset.


The Gap: September to January

Nothing was committed for three and a half months.

This is the part of the development timeline that git doesn't capture — the experimentation that didn't make it in. The second notebook pass. The rethinking of the feature set. The decision to move from xgb1 to a retrained xgb2 with log-transformed targets and a richer feature set.

When commits resumed in January 2026, the scope had completely changed.


Stage 1.5 Begins: The Frontend (January 7, 2026)

The January 7th commit added three things that changed everything: frontend/app.py (a 1,059-line Streamlit monolith), requirements.txt, and a PRD document.

The PRD lays out the philosophy of the stage:

"Electricity demand forecasting models often fail in practice because predictions are not interpretable to decision-makers, extreme weather events cause systematic forecast errors, and insights remain locked inside notebooks."

The goal wasn't just to serve predictions. It was to make predictions explainable — to close the gap between what the model knows and what a grid analyst can act on.

The PRD defined three core features:

  • Demand Forecasting Module — State-level daily forecasts up to 30 days out, with actual vs predicted historical view
  • Weather Impact & Risk Indicators — Heatwave detection, Cooling Degree Days visualization, extreme-event markers
  • Explainable AI Layer — RAG-powered natural language answers about forecasts, feature importance, and model confidence

All of this was initially crammed into one file.


The Feature Engineering Problem

Before the UI could work, the inference pipeline had to be built — and this is where things got genuinely hard.

The XGBoost model was trained on a specific feature set with a specific engineering pipeline. At inference time, I'm receiving weather data from an API. The gap between what the API gives me and what the model expects is enormous.

Problem 1: Units. ERA5 training data was in Kelvin. Open-Meteo API returns Celsius. A 273.15 offset that silently destroys your predictions if you miss it. The weather.py module handles this explicitly:

forecast_df["2m_temperature_max"] = forecast_df["2m_temperature_max"] + 273.15
forecast_df["2m_temperature_min"] = forecast_df["2m_temperature_min"] + 273.15
forecast_df["2m_temperature_mean"] = forecast_df["2m_temperature_mean"] + 273.15

Problem 2: The CDD threshold. Cooling Degree Days are computed as max(0, T_mean - T_base). The base temperature T_base was determined during training as 297.15 K (24°C). If I compute it differently at inference time, the CDD feature no longer means the same thing. The fix: store the threshold in model_metadata.json and read it at inference time.

Problem 3: The extreme heat flag. During training, extreme heat was defined as the 95th percentile of temperature across the full training dataset — a value of 305.08 K (31.9°C). But early code computed this percentile on the forecast window, not the training distribution. A 7-day forecast in mild weather would flag its warmest day as "extreme heat" even if that day was a perfectly normal 28°C afternoon. Fixed by storing extreme_heat_threshold_kelvin: 305.082483 in the metadata JSON.

Problem 4: The state_30d_baseline. The model uses a 30-day rolling historical average demand for each state as one of its features — and as the denormalization multiplier. This feature doesn't come from any weather API. It has to be computed from historical demand CSVs at inference time, matched to the same calendar dates. features.py handles this:

historical_demand = load_historical_demand(state_name)
if historical_demand is not None:
    df['state_30d_baseline'] = df['date'].apply(
        lambda d: calculate_state_30d_baseline(historical_demand, d)
    )

Problem 5: The lag features. The model was trained with lag-1, lag-3, and lag-7 temperature features. At forecast time, those don't exist. The current implementation approximates all lags with the current day's temperature. This is documented as a known limitation — and one that code review later flagged as introducing systematic bias.

The docstring that holds it all together:

def engineer_features(df, state_name, metadata=None):
    """
    Engineer features exactly as done in training.

    This function must match the notebook's feature engineering pipeline exactly.
    """

That comment is the most important line in the codebase. Feature engineering isn't preprocessing — it's a contract between training and inference.


The Second Model: XGBoost v2

Between January 7th and January 10th, the model was retrained. xgb2_model.pkl replaced xgb1_model.pkl.

The improvements:

  • Log-transformed target. Demand values were log-transformed before training and exponentiated back at inference. This stabilizes training on states with very different absolute demand magnitudes (Nagaland averages ~3 MU; Maharashtra averages ~250 MU).
  • Richer features. Temperature interaction terms, temperature range features, and extended rolling features.
  • Per-state RMSE in metadata. The model metadata now includes state-specific RMSE values. This matters because a single overall RMSE hides enormous variance.

The final model metrics:

Metric Value
0.9951
Overall RMSE 9.87 MU
Overall MAE 4.87 MU
Training period 2013–2022
Features 98
States covered 27

State-level RMSE breakdown (selected):

State RMSE (MU)
Nagaland 0.15
Manipur 0.20
Arunachal Pradesh 0.19
Assam 1.15
Kerala 1.75
Odisha 5.62
Bihar 6.13
Maharashtra 15.72
Tamil Nadu 15.82
Telangana 19.14
Rajasthan 25.88

Rajasthan's RMSE is 170× Nagaland's. One model, one country, radically different predictability. The northeast states have small, stable demand. The large northern and southern states have high variability driven by extreme temperature swings.


The RAG AI Assistant (January 10, 2026)

The January 10th commit added 2,641 lines of new code in a single push — the biggest change in the project's history.

The commit message: "Add RAG engine, utils, prompts, and update app, notebook, and model."

What it added:

  • utils/embeddings.py — embedding generation via sentence-transformers (local, free) with OpenRouter as optional fallback
  • utils/vector_store.py — FAISS-based vector store with persistence
  • utils/rag_builder.py — document builders that turn model metadata, feature importances, and forecast summaries into indexed text chunks
  • utils/rag_engine.py — query engine with similarity search and response generation
  • prompts/assistant.md — the system prompt for the AI assistant

The RAG system indexes three types of knowledge:

  1. Model metrics — built from model_metadata.json. State-specific RMSE, overall accuracy, training period.
  2. Feature importance — extracted from model.feature_importances_. Which weather variables actually drive demand.
  3. Forecast summary — generated from the live forecast. Peak demand date, average demand, high-risk dates, extreme heat dates — all locked into a structured schema.

The "locked schema" design is the key anti-hallucination mechanism. When the system generates a forecast summary, it creates a dictionary with explicit keys:

summary = {
    'state': state,
    'avg_demand': round(float(results_df['forecasted_demand_MU'].mean()), 1),
    'peak_demand': round(float(results_df['forecasted_demand_MU'].max()), 1),
    'peak_date': results_df.loc[results_df['forecasted_demand_MU'].idxmax(), 'date'].strftime('%Y-%m-%d'),
    'high_risk_dates': high_risk_dates,
    'extreme_heat_dates': extreme_heat_dates,
    'rmse': round(float(state_rmse), 2) if state_rmse is not None else None,
}

The RAG engine can only answer questions from what it finds in the vector store. It can't make up demand values — they came from the model. It can't fabricate RMSE — it's in the metadata. The assistant is grounded.


The Code Review: What Got Found

In early March 2026, the codebase went through a formal code review. Rating: 7.5/10. The review was detailed and found several real problems.

Finding 1: Confidence intervals are statistically wrong.

The app was computing confidence bands as z * RMSE:

z_score = stats.norm.ppf((1 + confidence_level) / 2)
margin = z_score * state_rmse * widening_factor
upper_bound = results_df['forecasted_demand_MU'] + margin

RMSE is not a standard deviation. It conflates both bias and variance. Using it as σ in a normal distribution confidence formula produces intervals that look reasonable but are numerically meaningless.

Finding 2: The extreme heat flag was computed on the wrong distribution (fixed by the metadata threshold — see above).

Finding 3: Placeholder zeros for historical features fed directly to the model.

The model was trained with real generation_mu, energymet_7d_avg, demand_rolling values. At inference time, all of these were hardcoded to 0.0. Since generation_mu = 0 implies a complete blackout, and the model learned weights for these features, the zeros introduce systematic prediction bias that's invisible to the user.

Finding 4: state_30d_baseline hardcoded to 100.0 as fallback.

For states without historical demand CSVs, the baseline was set to 100.0. For Maharashtra with mean demand ~250 MU, this produces wildly wrong predictions without any warning. The fix: use the state-specific get_fallback_baseline() function already present elsewhere in the codebase.

The overall assessment:

"The ML pipeline is sound (proper time-series CV, log-transform with correct reversal, state-specific RMSE), the RAG architecture is clean and modular, and the locked-schema design is an excellent defence against LLM hallucination. The main issues cluster around brittle feature engineering, silent data quality problems, and minor RAG correctness issues."


The Refactor: From 1,682-Line Monolith to Modules (March 16, 2026)

The most recent commit is the largest structural change in the project.

Before: app.py was 1,682 lines. Everything lived in one file.

After: app.py is 158 lines. It's now just an orchestrator:

app.py              → Page config, sidebar, tab routing (158 lines)
config.py           → Constants: STATE_COORDINATES, ALL_STATES
models.py           → Model loading, metadata, RMSE lookup
data_loading.py     → Historical data, baselines, denormalization
weather.py          → Open-Meteo API + climatology fallback
features.py         → Feature engineering + prediction prep
visualization.py    → All Plotly charts + weather insights
rag.py              → Forecast summary builder + RAG init
ui/forecast_tab.py  → Tab 1: Demand Forecast
ui/weather_tab.py   → Tab 2: Weather Impact Analysis
ui/assistant_tab.py → Tab 3: AI Energy Assistant

The notebook was also split into three separate files:

  • 01_data_preprocessing.ipynb
  • 02_eda.ipynb
  • 03_model_training.ipynb

This separation matters: the training notebook documents what was done. The inference code has to match it exactly. Keeping them separate and clearly labeled makes that contract visible.


The Current Architecture

As of Stage 1.5, the platform works like this:

Forecast Tab:

  • User selects a state and forecast horizon (up to 16 days)
  • weather.py calls Open-Meteo API — free, no API key required
  • If the API fails, it falls back to climatology (historical average for same calendar dates)
  • features.py engineers the full 98-feature input vector, matching training exactly
  • The XGBoost model predicts log-transformed demand, which is then denormalized using the state_30d_baseline
  • Results are displayed with confidence bands and plotted with Plotly

Weather Impact Tab:

  • Dual-axis plots overlaying demand forecast with temperature
  • Cooling Degree Day visualization
  • Extreme heat day markers
  • Feature importance chart from the model

AI Assistant Tab:

  • FAISS vector store, indexed with model metrics + feature importance + live forecast summary
  • sentence-transformers for local embeddings (no paid API)
  • Chat interface using Streamlit's st.chat_input
  • Answers include sources and a confidence score based on retrieval similarity
  • Context-aware: knows which state and horizon the user is currently viewing

What's Real and What's Approximated

What's real:

  • The XGBoost model was genuinely trained on 10 years of Indian state-level electricity demand + weather data
  • R² of 0.9951 reflects real held-out test performance, not training accuracy
  • State-specific RMSE values reflect per-state out-of-sample accuracy
  • The weather data comes from a real API (Open-Meteo), not synthetic data
  • The holiday detection uses the actual holidays library with India as the jurisdiction

What's approximated:

  • Lag temperature features (lag-1, lag-3, lag-7) are set to the current forecast day's temperature. This introduces some bias for multi-day forecasts.
  • Historical generation features (generation_mu, energymet_7d_avg, etc.) are set to 0. The model trained on real values; feeding zeros is a known source of degraded accuracy.
  • Confidence intervals are currently computed as z * RMSE, which is a pragmatic approximation, not a statistically rigorous prediction interval.

The code comments are explicit about all of these. The goal is a tool that's useful despite its limitations, not one that hides them.


What's Next

The roadmap:

  • Better uncertainty quantification — Proper prediction intervals from residual analysis rather than z * RMSE
  • Fix the generation feature gap — Load actual historical generation data at inference time so those features aren't zeroed out
  • Improve lag features — For multi-day forecasts, propagate predicted temperatures forward as lag inputs for subsequent days
  • Stage 2: Household-level analytics — Smart meter data, appliance-level disaggregation
  • Deployment — The project currently runs locally with streamlit run app.py

The Takeaways

Six months of development, 17 commits, one code review, one major refactor, and two model versions.

1. Feature engineering is the hardest part. Not the model. Not the UI. Getting training and inference to match exactly — same units, same thresholds, same state encoding order, same calendar handling — is where the real engineering is. Write it down. Name the contract explicitly in the docstring.

2. Store your thresholds in metadata, not in code. If a threshold was derived from training data (CDD base, extreme heat percentile), it belongs in model_metadata.json, not hardcoded. This is the difference between a model that produces consistent results and one that silently drifts.

3. Be honest about approximations. The lag feature problem is real. Zeroing out generation features is real. These degrade accuracy, and the users of this tool deserve to know. Comments, warnings, and documentation aren't weakness — they're what separates a research demo from a tool someone can actually trust.

4. The modular refactor was worth it. Going from 1,682 lines to 158 in app.py — with clear module boundaries — made the codebase reviewable, testable, and maintainable in a way the monolith never was. Do it early, not late.

5. RAG works best with locked schemas. The AI assistant is grounded because it can only answer from what's in the vector store, and what's in the vector store was generated from deterministic, structured data. The prompt engineering layer then formats the retrieval into natural language. That separation of concerns is what prevents hallucination.


Source: github.com/Riwan000/Electricity-Demand-Forecaster