Coding Agents and a Spooky Kaggle Challenge: Benchmarking Lightweight ML Automation

Why revisit spooky authors in 2025

Agentic ML tooling is having a moment. State-of-the-art systems like DeepMind’s MLE-STAR and Meta’s AIRA-dojo frequently place in the top 10% of Kaggle competitions on MLE-bench.¹ They’re powerful but heavy.

Day to day, I mainly use coding agents with a detailed text file for prompting and some shell scripts to set things up. Many of you probably do the same. I wanted to test how far that lightweight stack can go.

To find out, I implemented a “Memento”-style outer loop, where I encourage the coding agent to pass messages to its “future self” (the next invocation of the agent) in an organized way.

A tiny orchestration rig

See my template repo for the set up. Surprisingly, you don’t need much special tooling for it to work well. There is zero Python code at the start, instead just a few boilerplate files:

user_prompt.txt – a one-page spec that instructs the agent to perform a disciplined “outer” loop (because the agent already handles the inner loop): load context, declare a budget, run 2–3 experiments, journal what worked and what didn’t, plan next steps, update the status, and save the best results.
setup_env.sh – stands up a virtualenv, installs pinned deps, and even pre-downloads NLTK packages so the agent never has to ask for credentials.
prompt.sh & run_iterations.sh – thin wrappers that activate the venv, launch the chosen CLI (claude, gemini, codex, cursor), and optionally auto-commit after each loop.

That’s it. The agent sees the repo exactly like a data scientist would: git history, prior experiments, and a scratchpad of ideas. This “minimal stack” was enough for your coding agent to discover multi-seed ensembles and run 60 experiments in under a day.

Feel free to use this agentic template. Instructions are there for trying it out yourself.

The test bed

The classic Spooky Author Identification playground competition. The task is to classify short horror passages by Edgar Allan Poe, Mary Shelley, or H. P. Lovecraft. For example, which author wrote the following passage?

“My host was now leading the way down cellar to his actual studio, and I braced myself for some hellish effects among the unfinished canvases.”

What I learned

I ran four different coding agents (Codex, Claude, Gemini and Cursor²) on the Spooky Author Identification task. Each agent is allowed to run 10 iterations. Each agent ended up spending somewhere from 2.5 hours to 13 hours having fun on my mac mini. Here’s what I learned:

Coding agents are quite capable and trained a strong baseline model. I expected the agents to struggle because I didn’t provide specific instructions or tools. Author attribution from short text snippets can be tricky, requiring models to pick up on stylistic tics and long-tail vocabulary patterns. But watching them work, I realized they figure out quite a few things on their own, helping me discover good ingredients I didn’t think of. All of the agents matched or beat the Kaggle median (and Kaggle participants know a thing or two about data science), some comfortably above the median. None of them have reached the bronze cutoff (top 10%) yet. But, there are lot of room for improving how I use them. My prompt is probably far from optimal. The number of iterations may not be enough. And the coding agent was not on the strongest LLM yet.

Automation moves fast but leaves a mess. All four agents sprawled dozens of experiment entries, cached matrices, and partially refactored modules. They ship improvements, yet still requires a human expert to reconcile redundant scripts, prune dead notebooks, and audit for data leakage before anything is production ready. I expect cleanup will take significant time.

The last mile remains human. Claude pushed CV log loss down to 0.3447 with a TF-IDF + MLP ensemble, better than the others but still a hike away from the bronze cutoff (top 10%). Agentic ML tools democratize experimentation, but careful experiment design, data quality checks, and smart engineering still demand deliberate expertise. I do not see myself and fellow folks out of job any time soon, and I am confident we can all adapt, adopt, and super-charge ourselves.

How the agents actually performed

The main metric is log loss (lower is better). The Kaggle leaderboard median sits at 0.4188, and the bronze cutoff (top 10%) is 0.2938. The table below shows each agent’s score compared to that median: negative numbers mean the agent beat the median.

Agent	Log Loss	vs. Kaggle Median	Summary
Claude	0.37222 (reached 0.3447 but regressed)	-0.047	3-seed TF-IDF + MLP blend. Stacking meta-learner backfired.
Codex	0.35897	-0.060	Multi-seed TF-IDF + logistic ensemble.
Composer	0.38715	-0.032	TF-IDF + logistic with sublinear scaling. Exploring vocab tuning.
Gemini†	0.42398	+0.005	Word+char features. Over-regularized at first (C=0.1 → 0.696).

† Gemini ran on gemini-flash-2.5 because my gemini-pro-2.5 quota was exhausted. Sorry Gemini! Will give the pro version a try another time.

Is there a clear winner among the four agents? Not quite. With one task and one run per agent, these deltas are easily within noise. No overall champion crowned. But three of four agents beat the Kaggle median on their first serious attempt, which suggests the orchestration pattern itself is sound.

What struck me was how differently each agent approached the problem. Codex went wide with multi-seed ensembles. Claude pushed harder on model architecture (adding MLPs). Composer methodically swept hyperparameters. Gemini (2.5-flash) stumbled early with over-regularization but corrected course. Each strategy reflects the underlying model’s tendencies, but all converged on TF-IDF as the feature foundation.

Representative session logs (expand for highlights from each agent’s run)

Codex full sweep: five-model ensemble where stylometric probabilities took 50% of the final weight
Composer mid-run: Oct 29 session documented a 4.4% log-loss drop after raising max_features to 25k
Claude iteration: stacking meta-learner chewed 138 minutes only to land 10.4% worse than simple blend
Gemini early run: over-regularization with C=0.1 spiked to 0.696 log loss before correcting course

What the agents leave behind

Looking at the work products, I realize the hardest part is not the setup or the experimentation. It is the cleanup.

Automation litters the repo. Codex’s run now tracks five variants of the same logistic pipeline, complete with separate OOF dumps, weight search scripts, and registry YAMLs. Composer’s train.py mixes LightGBM, logistic regression, handcrafted features, and a sentence-transformer branch inside a single file that keeps toggling SKIP_EXISTING_EXPERIMENTS. The agents do not delete anything; they prototype, leave artifacts behind, and move on.

There is valuable signal inside the noise. The streaming JSON logs and the journal files double as a lab notebook: one Codex session diagnosed short texts (21–81 characters) as the chief failure mode (76.8% accuracy and 0.566 log loss vs. 93.7% / 0.204 for long passages) and immediately reprioritized feature work around that gap. That’s valuable signal buried in transcript.

The messy bits are the code paths, not the telemetry. Before shipping any of this to production, you need to:

Restore clean, reproducible code – to make your life easier.
Audit data usage – to be sure the metrics are real, check if there is data leakage, faulty experiental design, and stress test the model on a hold out dataset.
Normalize experiment logging – ensure experiments.csv retains consistent schemas so future analysis can reason about which parameters actually mattered.

Even if cleanup requires a lot of time, I still view agentic ML as a net win (they generated hypotheses and ran experiments faster than I could manually), but it’s not as simple as pressing button to get the solution.

Detailed experiment progression (expand for iteration-by-iteration findings)

Codex

Iteration 1: Word+char TF-IDF logistic slashed log loss from 0.4660 to 0.3875 and surfaced author-specific tokens (Poe’s “of the/upon”, Lovecraft’s “though/west”, Shelley’s character cues), validating the baseline. Journal
Iterations 2–4: Cross-seed checks and char-vocabulary diagnostics showed the 0.3819 gain was variance-prone. Not every bet landed: 256-component SVD exploded to 0.59 log loss. Journal 1 · Journal 2
Iteration 5: Built out OOF persistence, averaged three min_df=2 seeds plus a min_df=3 variant, trimming log loss to 0.3764 while keeping training strictly on folds. Journal
Iteration 6: Diagnostics catalogued LightGBM’s collapse (≥0.49 log loss) and the HPL→EAP confusion hotspot (585 errors), steering work toward Lovecraft-specific features. Journal
Iterations 7–9: Further C tuning, repeated-CV variance checks, and Lovecraft token whitelists, concluding remaining gains require smarter feature curation rather than more seeds. Journal 1 · Journal 2
Clock time: ~4 h 52 m between first and last logs (idle gaps included).

Composer

Kickoff: TF-IDF + logistic regression established a 0.4811 baseline; LightGBM and quick ensembles underperformed, proving the task is mostly linear. Journal
Early experiments: Sentence-transformer embeddings bombed at 0.6715, underscoring that semantics alone can’t beat stylistic n-grams for authorship. Logistic tuning (C≈5) slashed log loss to 0.4522. Journal 1 · Journal 2
Mid-run: Joint tuning of C and vocabulary width delivered the biggest gains, with most lift from expanding max_features to 10k and 25k, dropping to 0.4275. Journal
Late game: Coordinated tuning of C, n-gram ranges, and vocabulary size landed at 0.3984; adding sublinear_tf=True with C=4.5 finished at 0.3943 and shipped the submission. Along the way, a “stylometric booster” cratered performance to 0.680. Journal 1 · Journal 2
Clock time: ~6 h 59 m across Composer’s logged iterations.

Claude

Baseline: Logistic regression beat LightGBM (0.43 vs 0.59) out of the gate, confirming sparse TF-IDF prefers linear models. Early error analysis quantified the short-text tax (24.7% error under 10 words). Journal
Early tuning: Lower regularization (C≈10) tightened CV log loss to 0.3814, revealing dominant confusions (MWS→EAP 10.7%, HPL→EAP 10.1%). Journal 1 · Journal 2
Breakthrough: A modest MLP (256→128) hit 0.3656. Learning-rate tuning dropped it to 0.3519, and blending it 30/70 with logistic yielded a robust 0.3495. Journal 1 · Journal 2
Peak: Averaging three seeds for each model produced the 0.3447 best-in-class ensemble without destabilizing variance. Journal
Cautionary tail: Stacking meta-learner ground for 2.3 hours, overshooting budget, yet finished 10.4% worse than the simple blend. Five seeds and batch norm also backfired (2.72% and 36% worse). Journal 1 · Journal 2
Clock time: ~13 h 55 m between Claude’s earliest and latest logs (stacking iterations account for much of it).

Gemini (2025-10-30)

Baseline: Initial TF-IDF logistic run clocked 0.557 log loss, giving the agent a concrete hill to climb. An overly strong regularization run (C=0.1) erupted to 0.696 before the agent corrected course. Journal
Feature mix: Adding char n-grams and text-length features while relaxing C dropped CV log loss into the 0.43 range. Journal
Current best: Tuning character sublinear_tf/use_idf combinations landed at 0.4299 CV and 0.42398 on the grader, still above the Kaggle median but short of medal territory. Journal
Clock time: ~2 h 50 m for the logged Gemini run (on gemini-flash-2.5).

Run-time estimates derive from file modification times of the earliest and latest logs/*.log entries; they include idle gaps between iterations.

Appendix: Full repository links

All code, experiment artifacts, and raw stream-json transcripts are public:

Codex: Status · Setup script · Session log
Composer: Status · Early session · Final session
Claude: Status · Session log
Gemini: Status · Session log

Footnotes

MLE-bench is a benchmark comprising 75 Kaggle competitions for evaluating ML agents (GitHub repo). MLE-bench lite is a curated subset of 22 tasks. Both MLE-STAR and AIRA-dojo use bespoke multi-agent frameworks: MLE-STAR combines web search for model discovery with targeted refinement guided by ablation studies (Nam et al., 2025), achieving medals in 64% of MLE-bench lite competitions. AIRA-dojo provides specialized operators and multiple search policies (greedy, MCTS, evolutionary) to explore solution spaces (research paper), achieving 47.7% medal rate on MLE-bench lite.↩︎
Cursor with the new composer-1 model.↩︎