What I learned
I ran four different coding agents (Codex, Claude, Gemini and Cursor2) on the Spooky Author Identification task. Each agent is allowed to run 10 iterations. Each agent ended up spending somewhere from 2.5 hours to 13 hours having fun on my mac mini. Here’s what I learned:
Coding agents are quite capable and trained a strong baseline model. I expected the agents to struggle because I didn’t provide specific instructions or tools. Author attribution from short text snippets can be tricky, requiring models to pick up on stylistic tics and long-tail vocabulary patterns. But watching them work, I realized they figure out quite a few things on their own, helping me discover good ingredients I didn’t think of. All of the agents matched or beat the Kaggle median (and Kaggle participants know a thing or two about data science), some comfortably above the median. None of them have reached the bronze cutoff (top 10%) yet. But, there are lot of room for improving how I use them. My prompt is probably far from optimal. The number of iterations may not be enough. And the coding agent was not on the strongest LLM yet.
Automation moves fast but leaves a mess. All four agents sprawled dozens of experiment entries, cached matrices, and partially refactored modules. They ship improvements, yet still requires a human expert to reconcile redundant scripts, prune dead notebooks, and audit for data leakage before anything is production ready. I expect cleanup will take significant time.
The last mile remains human. Claude pushed CV log loss down to 0.3447 with a TF-IDF + MLP ensemble, better than the others but still a hike away from the bronze cutoff (top 10%). Agentic ML tools democratize experimentation, but careful experiment design, data quality checks, and smart engineering still demand deliberate expertise. I do not see myself and fellow folks out of job any time soon, and I am confident we can all adapt, adopt, and super-charge ourselves.
How the agents actually performed
The main metric is log loss (lower is better). The Kaggle leaderboard median sits at 0.4188, and the bronze cutoff (top 10%) is 0.2938. The table below shows each agent’s score compared to that median: negative numbers mean the agent beat the median.
| Agent | Log Loss | vs. Kaggle Median | Summary |
|---|---|---|---|
| Claude | 0.37222 (reached 0.3447 but regressed) | -0.047 | 3-seed TF-IDF + MLP blend. Stacking meta-learner backfired. |
| Codex | 0.35897 | -0.060 | Multi-seed TF-IDF + logistic ensemble. |
| Composer | 0.38715 | -0.032 | TF-IDF + logistic with sublinear scaling. Exploring vocab tuning. |
| Gemini† | 0.42398 | +0.005 | Word+char features. Over-regularized at first (C=0.1 → 0.696). |
† Gemini ran on gemini-flash-2.5 because my gemini-pro-2.5 quota was exhausted. Sorry Gemini! Will give the pro version a try another time.
Is there a clear winner among the four agents? Not quite. With one task and one run per agent, these deltas are easily within noise. No overall champion crowned. But three of four agents beat the Kaggle median on their first serious attempt, which suggests the orchestration pattern itself is sound.
What struck me was how differently each agent approached the problem. Codex went wide with multi-seed ensembles. Claude pushed harder on model architecture (adding MLPs). Composer methodically swept hyperparameters. Gemini (2.5-flash) stumbled early with over-regularization but corrected course. Each strategy reflects the underlying model’s tendencies, but all converged on TF-IDF as the feature foundation.
Representative session logs (expand for highlights from each agent’s run)
- Codex full sweep: five-model ensemble where stylometric probabilities took 50% of the final weight
- Composer mid-run: Oct 29 session documented a 4.4% log-loss drop after raising
max_featuresto 25k - Claude iteration: stacking meta-learner chewed 138 minutes only to land 10.4% worse than simple blend
- Gemini early run: over-regularization with C=0.1 spiked to 0.696 log loss before correcting course
What the agents leave behind
Looking at the work products, I realize the hardest part is not the setup or the experimentation. It is the cleanup.
Automation litters the repo. Codex’s run now tracks five variants of the same logistic pipeline, complete with separate OOF dumps, weight search scripts, and registry YAMLs. Composer’s train.py mixes LightGBM, logistic regression, handcrafted features, and a sentence-transformer branch inside a single file that keeps toggling SKIP_EXISTING_EXPERIMENTS. The agents do not delete anything; they prototype, leave artifacts behind, and move on.
There is valuable signal inside the noise. The streaming JSON logs and the journal files double as a lab notebook: one Codex session diagnosed short texts (21–81 characters) as the chief failure mode (76.8% accuracy and 0.566 log loss vs. 93.7% / 0.204 for long passages) and immediately reprioritized feature work around that gap. That’s valuable signal buried in transcript.
The messy bits are the code paths, not the telemetry. Before shipping any of this to production, you need to:
- Restore clean, reproducible code – to make your life easier.
- Audit data usage – to be sure the metrics are real, check if there is data leakage, faulty experiental design, and stress test the model on a hold out dataset.
- Normalize experiment logging – ensure
experiments.csvretains consistent schemas so future analysis can reason about which parameters actually mattered.
Even if cleanup requires a lot of time, I still view agentic ML as a net win (they generated hypotheses and ran experiments faster than I could manually), but it’s not as simple as pressing button to get the solution.
Detailed experiment progression (expand for iteration-by-iteration findings)
Codex
- Iteration 1: Word+char TF-IDF logistic slashed log loss from 0.4660 to 0.3875 and surfaced author-specific tokens (Poe’s “of the/upon”, Lovecraft’s “though/west”, Shelley’s character cues), validating the baseline. Journal
- Iterations 2–4: Cross-seed checks and char-vocabulary diagnostics showed the 0.3819 gain was variance-prone. Not every bet landed: 256-component SVD exploded to 0.59 log loss. Journal 1 · Journal 2
- Iteration 5: Built out OOF persistence, averaged three min_df=2 seeds plus a min_df=3 variant, trimming log loss to 0.3764 while keeping training strictly on folds. Journal
- Iteration 6: Diagnostics catalogued LightGBM’s collapse (≥0.49 log loss) and the HPL→EAP confusion hotspot (585 errors), steering work toward Lovecraft-specific features. Journal
- Iterations 7–9: Further C tuning, repeated-CV variance checks, and Lovecraft token whitelists, concluding remaining gains require smarter feature curation rather than more seeds. Journal 1 · Journal 2
- Clock time: ~4 h 52 m between first and last logs (idle gaps included).
Composer
- Kickoff: TF-IDF + logistic regression established a 0.4811 baseline; LightGBM and quick ensembles underperformed, proving the task is mostly linear. Journal
- Early experiments: Sentence-transformer embeddings bombed at 0.6715, underscoring that semantics alone can’t beat stylistic n-grams for authorship. Logistic tuning (C≈5) slashed log loss to 0.4522. Journal 1 · Journal 2
- Mid-run: Joint tuning of
Cand vocabulary width delivered the biggest gains, with most lift from expandingmax_featuresto 10k and 25k, dropping to 0.4275. Journal - Late game: Coordinated tuning of C, n-gram ranges, and vocabulary size landed at 0.3984; adding
sublinear_tf=TruewithC=4.5finished at 0.3943 and shipped the submission. Along the way, a “stylometric booster” cratered performance to 0.680. Journal 1 · Journal 2 - Clock time: ~6 h 59 m across Composer’s logged iterations.
Claude
- Baseline: Logistic regression beat LightGBM (0.43 vs 0.59) out of the gate, confirming sparse TF-IDF prefers linear models. Early error analysis quantified the short-text tax (24.7% error under 10 words). Journal
- Early tuning: Lower regularization (C≈10) tightened CV log loss to 0.3814, revealing dominant confusions (MWS→EAP 10.7%, HPL→EAP 10.1%). Journal 1 · Journal 2
- Breakthrough: A modest MLP (256→128) hit 0.3656. Learning-rate tuning dropped it to 0.3519, and blending it 30/70 with logistic yielded a robust 0.3495. Journal 1 · Journal 2
- Peak: Averaging three seeds for each model produced the 0.3447 best-in-class ensemble without destabilizing variance. Journal
- Cautionary tail: Stacking meta-learner ground for 2.3 hours, overshooting budget, yet finished 10.4% worse than the simple blend. Five seeds and batch norm also backfired (2.72% and 36% worse). Journal 1 · Journal 2
- Clock time: ~13 h 55 m between Claude’s earliest and latest logs (stacking iterations account for much of it).
Gemini (2025-10-30)
- Baseline: Initial TF-IDF logistic run clocked 0.557 log loss, giving the agent a concrete hill to climb. An overly strong regularization run (C=0.1) erupted to 0.696 before the agent corrected course. Journal
- Feature mix: Adding char n-grams and text-length features while relaxing
Cdropped CV log loss into the 0.43 range. Journal - Current best: Tuning character
sublinear_tf/use_idfcombinations landed at 0.4299 CV and 0.42398 on the grader, still above the Kaggle median but short of medal territory. Journal - Clock time: ~2 h 50 m for the logged Gemini run (on
gemini-flash-2.5).
Run-time estimates derive from file modification times of the earliest and latest logs/*.log entries; they include idle gaps between iterations.
Appendix: Full repository links
All code, experiment artifacts, and raw stream-json transcripts are public:
- Codex: Status · Setup script · Session log
- Composer: Status · Early session · Final session
- Claude: Status · Session log
- Gemini: Status · Session log
Footnotes
MLE-bench is a benchmark comprising 75 Kaggle competitions for evaluating ML agents (GitHub repo). MLE-bench lite is a curated subset of 22 tasks. Both MLE-STAR and AIRA-dojo use bespoke multi-agent frameworks: MLE-STAR combines web search for model discovery with targeted refinement guided by ablation studies (Nam et al., 2025), achieving medals in 64% of MLE-bench lite competitions. AIRA-dojo provides specialized operators and multiple search policies (greedy, MCTS, evolutionary) to explore solution spaces (research paper), achieving 47.7% medal rate on MLE-bench lite.↩︎
Cursor with the new composer-1 model.↩︎