Why AI training plans usually fail, and when they do not

At some point most runners and cyclists try asking an AI for a training plan. The result is usually disappointing, not because the LLM/AI is stupid, but because it is working blind.

I connected ChatGPT, Claude and Mistral to my training data in Tredict via MCP and asked each to help me plan my training. All three did well at analysing individual sessions or spotting trends in the data. But when it came to actually building a training plan, the differences showed up fast.

ChatGPT lost the thread as soon as it had to plan more than one session at a time. The structure fell apart, sessions started contradicting each other, and the overall direction stopped making sense. Mistral struggled even more.

Claude was the only one that stayed coherent across the full planning process and built sessions that actually followed from each other. Same data, same MCP connection, same prompt.

The difference likely comes down to how much context each setup can hold at once. Training data is dense, and the token limit fills up fast. With a larger context window ChatGPT and Mistral might do just as well.

I am not a Claude fanboy. I would happily use any assistant that gets this right. But at the time of writing, Claude is the only one that does, so that is why this article and blog appear maybe so Claude centric.

Now see for yourself. Same prompt, same MCP connection, same data. First ChatGPT, then Claude. The difference speaks for itself.

Here is what a conversation with ChatGPT looks like. At first glance it is not too bad actually, but if you look closer things fall apart. ChatGPT misses the entire part, almost 3 months, before January, likely due to the small context window. It gets my easy pace completely wrong with 4:30/km, even though it is correctly presented by the MCP tool, and the created plan uses constantly 85% of my HRmax. This is way too high for easy runs! :-D At the end of the week ChatGPT creates two high effort long runs in a row. This does not make sense at all from a training perspective. The resulting plan is definitely not usable!

ChatGPT works blind, most likely due to its smaller context window and tool token call limit.

Open the real conversation in ChatGPT: Analyze training history and create training plan

Please analyze the recent 6 months of my trainings history and create a 2-week training plan based on your assessments in Tredict.

Key Insights from Your Training

1. Strong comeback after interruption: Big gap between January to early March, then structured rebuild. March shows very consistent progression (excellent execution). You followed a textbook return: HR control, cadence, strides, longer runs.

2. Current fitness profile: Easy pace stabilized around ~4:15 to 4:30/km (aerobic). HR mostly controlled (140 to 155 bpm), good aerobic base rebuilding. Longest recent run: ~8 to 9 km, still early base phase.

3. Important limiter, asymmetry: Your note on GCT imbalance (~48.7%) is very relevant. This is NOT fixed yet, must influence plan design.

Conclusion: You are past rehab, but not ready for real intensity yet. Perfect timing for: controlled volume increase, light neuromuscular work (strides), first very soft tempo exposure.

Your 2-Week Strategy (what I built): 4 runs/week, 1 light quality session, 1 long run, 2 recovery days, strength/mobility integrated.

ChatGPT training plan creation showing intensity distribution chart and workout details with a note saying this is not really meaningful — ChatGPT's training plan output. The intensity distribution and relative HRmax is constantly off in moderate range, rendering the plan actually unusable.

Here is the plan itself, created by ChatGPT 5.3 Instant in Tredict. Click on it to get the full picture of the mess. :-)

2-Week Running Rebuild Plan

A model that has actually seen your training

When a capable AI assistant has access to your actual training history through MCP, the situation changes.

Instead of working from your description, it can look at what you have actually been doing. Your recent sessions, your pace at different heart rate zones, your capacity values over time, how your load has developed across weeks and months. It can see gaps, trends and patterns that you might not have noticed yourself.

The plan it builds from that is not a generic template adjusted for your stated fitness level. It is built on what your body has been doing.

I noticed this clearly during my own comeback after six weeks off sick. Claude had access to my full training history through Tredict and could see exactly where I was coming from. The first week of sessions it planned were conservative in exactly the right way. Not because I told it to go easy, but because it could see what my baseline looked like and how much of a gap there was to fill.

And here is what a complex planning with training analysis and plan creation can look like in a meaningful way. The pre-analysis based on my 6 months of training is spot on. And the resulting plan is meaningful, distributes effort evenly with the correct balance, is in the right intensity zone with the correct relative targets. This is a good training plan with good explanations! Claude gets it right without struggling at all.

Open the real conversation in Claude: Claude trains me after being ill again

Please analyze the recent 6 months of my trainings history and create a 2-week training plan based on your assessments in Tredict.

6-Month Training Analysis

Phase 1, Pre-illness base (Oct to Nov 2025): Solid aerobic base. 3 to 4 sessions/week, ranging from 4.7 km to 11.5 km. Effort values in the 47 to 122 range, consistent week-over-week. Paces around 6:30 to 7:30/km with HRs nicely in the 138 to 155 bpm zone. This was your healthiest block in the period.

Phase 2, Apparent illness gap (late Nov to mid Feb 2026): Training dropped to 1 to 2 runs/week in November, then a clear multi-week gap through January/February (only 2 short Porto runs in Jan). The illness hit hard.

Phase 3, Comeback (Mar 2026): Textbook structured return over 5 weeks: 4 re-entry runs (HR-controlled, walk breaks), then cadence-focus week, then first strides on Mar 23, long run Mar 29, Fahrtspiel on Mar 31. Volume climbed from ~4.5 km back to ~8 to 9 km. Your effort notes confirm you managed this with good discipline.

Current state (April 2026): The Apr 2 run (Grundlage + strides, 4.7 km) noted a "slight illness feeling." Effort was moderate (43), volume pulled back, and no session since. This is the right call if you weren't feeling well.

Key metrics: hrMax 197, hrLth 172, FTPa 340 sec/km (= 5:40/km). Aerobic paces currently running around 6:50 to 7:15/km, which is still ~25 to 30 sec/km slower than comeback-peak, normal after the illness interruption.

Plan approach: The next 2 weeks should be a careful "re-stabilisation" block. Confirm health first, rebuild the aerobic base volume, keep one quality session (strides) per week, no new intensity introduced yet. 4 runs/week target, 3 easy + 1 with strides.

Now creating the plan in Tredict.

Claude conversation showing a detailed 2-week post-illness training plan created directly in Tredict with structured phases and relative targets — Claude's training plan. It picks up the full 6-month history, gets the paces and efforts right, and writes the plan directly into Tredict.

Here is the real training plan created by Claude Sonnet 4.6 in Tredict. If you want to get an in-depth picture of a meaningful AI training plan that can be iterated on, click on it.

Re-Stabilisation – 2 Weeks Post-Illness (April 2026)

Generic input, generic output

But the choice of AI assistant is only one part of the story. Another part is what data it has to work with.

When you describe your fitness level in a chat, you are giving the AI a rough sketch. You might say you run three times a week and finished a half marathon last year. The model takes that and generates something plausible based on general training principles. It sounds reasonable. But it has no idea what your actual threshold pace is, how your heart rate behaves on easy runs, whether your aerobic base is solid or hollow, or how you responded to load over the past months.

Generic input produces generic output. The plan fits the description you gave, not the body you have.

This is also why automated fitness assessments built into training platforms often frustrate people. If the underlying data is thin or the algorithm misreads a session, the resulting plan is calibrated wrong from the start. Too hard, too easy, or simply structured in a way that does not match how you actually respond to training stress.

Each session feeds the next one

Adjusting a training plan based on how things are going is nothing new. Tredict has always made that easy. You can see your form curve, compare what you planned against what you actually did, and move sessions around. The data is there, the tools are there. But you still need the willingness to sit down, read the data and decide what to change. Or a coach who does it for you.

What changes with an AI that can read your data is that you get a conversation partner for that process. After a run I would come back to Claude and ask how it went. Claude pulled the session from Tredict, looked at the heart rate curve, compared the actual paces to what was planned, checked where I sat in my zones. Sometimes the verdict was straightforward: the easy run was easy, move on. Other times it spotted something. Heart rate drifting higher than expected at the same pace, or a threshold segment where I held the pace but the effort cost more than it should have.

From that read it adjusted what came next. Not a full rewrite, but small corrections. An extra recovery day, a shorter interval block, a long run pushed back by two days. The plan evolved with the training instead of sitting above it.

What I did not expect is how much I started to enjoy this process. Going back through a session with Claude became a way of actually reflecting on the training, not just logging it and moving on. You notice things in that conversation that you would not catch glancing at a summary screen. During my comeback, my second run felt harder than the first even though pace and distance were almost identical. Claude pulled both sessions and showed me why. The average heart rate had jumped from 142 to 148 bpm at the same speed. The body was not fully recovered yet, working harder for the same output. At one point Claude noticed that comparing ground contact times between runs at different speeds was misleading and proposed normalizing the values to a reference pace. It ran a regression through the data points and showed that the real efficiency difference between two runs was smaller than the raw numbers suggested. These are the kinds of observations that add up over weeks to a much better understanding of where you actually are.

The iterative rhythm also builds something that a one-shot plan cannot: continuity. Each session carries context from the ones before it. By week three, Claude was not just planning from my training history, it was planning from a shared history of conversations, adjustments and decisions we had made together. That is a fundamentally different starting point than asking for a fresh plan every Monday.

Claude-generated scatter plot showing ground contact time normalized to a reference pace, with regression lines comparing two runs — Ground contact time normalized to a reference pace. Claude built this regression unprompted after noticing that raw GCT comparisons between runs at different speeds were misleading.

Relative targets, absolute fit

One thing that makes this work well in practice is how Tredict handles workout targets. When Claude creates a workout, it does not write absolute values like "run at 145 bpm". It uses relative targets. 70 percent of your max heart rate, or 120 percent of your functional threshold pace. These values adjust automatically to your current capacity. Even if your zones are not perfectly calibrated, the relative structure of the workout stays correct. The AI brings the training theory, Tredict handles the math.

Caveats

The AI can misread data, and confident-sounding output is not always correct output. If you have unusual patterns in your training history, the model might interpret them wrong. Through the Tredict MCP Server it already has access to your sleep and HRV data, so it can factor in recovery and stress without you saying anything. But things like nutrition, work stress or how a niggle in your knee actually feels are only in the picture if you mention them. The more context you give it, the better the output gets. But there is a paradox here! More data also fills up the context window, which at some point causes the same problems. ChatGPT and Mistral hit that wall early, Claude hits it later, but all of them hit it eventually. You need a feel for when the conversation has gotten too long and it is time to start a fresh one.