arxiv.org publishes ~200 ML papers per day. You can’t read them all; you need to choose carefully and read efficiently. This section is the methodology: the three-pass approach, the questions to ask at each pass, the red flags to spot, and the “replicate from description” test. Reading well is its own skill — most of the value comes from the 5% of papers you read deeply, identified by triaging the other 95%. This is what frontier-lab researchers actually do.
The three-pass reading method
Keshav 2007 (“How to Read a Paper”) outlined what’s now the standard methodology. Adapted for ML:
Pass 1 — Skim (5-10 minutes per paper):
Goal: decide whether to read further.
Read:
- Title + abstract + figure 1 (the "headline").
- Introduction (skim).
- Section headers + conclusion.
- Glance at references — what's the field?
Output:
- One sentence: what is this paper claiming?
- One sentence: what's the evidence?
- Stop reading if: (a) outside your interest, (b) clearly weak,
(c) you already know the result.
Pass 2 — Careful (1-2 hours per paper):
Goal: understand the contribution.
Read:
- All sections, including methodology.
- All figures and tables in detail.
- Skim related work to understand positioning.
- Note experimental setup: datasets, baselines, hyperparameters.
Output:
- Clear understanding of what was done.
- Identification of key claims and supporting evidence.
- List of questions / things you don't understand.
Pass 3 — Critical (4-8 hours per paper, only for important papers):
Goal: be able to defend or attack the paper.
Read:
- All sections, including math derivations.
- Re-implement parts mentally or in code.
- Read the original references for surprising claims.
- Find a friend who has read it; discuss disagreements.
Output:
- You could give a 30-minute talk on this paper.
- You can identify which parts of the methodology are strong vs hand-wavy.
- You have an opinion: is this a contribution or polish?
Most papers stop at Pass 1 (“not relevant” or “weak”). A few dozen per year reach Pass 2. Only the seminal papers (FlashAttention, Chinchilla, DPO, LoRA) deserve Pass 3.
What to look for in each pass
Pass 1 checklist (the skim):
- Is the title precise or marketing?
"Efficient X" vs "X Improved By 3.7% On Y" — the latter is more honest.
- Does figure 1 tell a story?
Good papers have a clear figure 1 that captures the contribution.
- Are the comparisons fair?
Same training budget? Same data? Same hyperparameter sweep?
Pass 2 checklist (the careful read):
- What's the claim? Make it specific. "Improves performance" isn't a claim;
"achieves 92.3% accuracy on MMLU vs 90.1% for baseline" is.
- What's the evidence? Tables and figures should back up the claim.
- Are the baselines strong? Compare to the actual SOTA, not a weakened version.
- Are the ablations comprehensive? Removing each component should show its contribution.
- Are the hyperparameters justified? Often "we swept LR" hides "we ran 500 configs."
- Is the code released? Available code is the strongest signal of correctness.
Pass 3 checklist (the critical read):
- Could you replicate from the description alone?
- Are the equations dimensionally consistent?
- Do the proofs (if any) actually prove what they claim?
- Are there unstated assumptions that limit generality?
- Does the conclusion match the evidence, or does it overstate?
Red flags
Overclaimpaper red flagA paper's claim that significantly exceeds what its evidence supports. Examples: 'state-of-the-art' when the baseline is weak; 'general' when only one dataset is tested; 'works across model sizes' when only one size is shown. Common in ML where the pressure to publish frontier results encourages stretching. Detection: read the abstract's strongest claim, then verify in the experimental tables; the gap is the overclaim. is the most common issue:
Red flags to watch for:
1. STRONG ABSTRACT vs WEAK TABLE.
Abstract: "achieves SOTA on language modeling."
Table: "MMLU 0.5% better than baseline that's 2 years old."
→ Overclaim. The improvement is small and the baseline is weak.
2. CHERRY-PICKED COMPARISONS.
Method comparison table shows the proposed method beating baseline A, B, C.
But the paper doesn't compare to D, which is well-known and beats both.
→ Cherry picking by omission.
3. UNFAIR BUDGETS.
Proposed method runs for 100 hours; baseline runs for 10 hours.
The result is comparable. "Our method is better" — but is it?
4. NO ABLATION.
The proposed method is "components X + Y + Z" but no test removes any.
You don't know which component does the work.
5. CONVENIENT BENCHMARKS.
Custom benchmark designed to favor the proposed method.
Avoid evaluation on standard benchmarks (HellaSwag, MMLU, etc.).
6. NO CODE.
A paper without released code is harder to trust.
Especially for results that depend on careful hyperparameter tuning.
7. PROMOTIONAL LANGUAGE.
"Revolutionary", "novel", "first", "unprecedented".
Real contributions speak softer: "we propose", "we observe".
8. TINY EFFECT SIZES.
"1.2% better than baseline" with no statistical significance test.
For typical benchmarks, the noise floor is ~0.5-1.0%.
— think, then check —
Day 1 strategy:
You can do ~30-50 Pass 1 skims in a day if focused. That’s 100 papers in 2-3 days.
For each paper:
Read title + abstract (1 minute).
Look at figure 1 + conclusion (2 minutes).
Decide: relevant? worth Pass 2? — 1 minute.
Out of 100 papers: ~70 will be filtered out at Pass 1 (off-topic, clearly weak, already-known result). ~25 will reach “maybe Pass 2.” ~5 will be “definitely Pass 2.”
Day 2-3 strategy:
Pass 2 the ~5 definite candidates (5-10 hours total). Be thorough — these are your week’s deep reading.
Skim Pass 1 the ~25 “maybes” — most will turn out not to deserve Pass 2 either; a few will surprise you.
Of the 30 you Pass 2 in detail this week: maybe 2-3 deserve Pass 3 (re-reading with code or mental simulation). Save those for the following week.
How to find the candidates:
arxiv-sanity-preserver, arxiv.org listings filtered by your subfields.
Twitter/X following 30-50 researchers whose taste you trust.
You’re learning to predict which papers will matter in 6 months. This skill builds slowly. Some heuristics:
Papers from teams with track records (Anthropic, OAI, DeepMind, etc.) often deserve attention.
Papers proposing a simpler version of an existing complex method often matter more than they look.
Papers with surprising negative results (e.g., “PPO doesn’t help as much as we thought”) matter as much as positive results.
Papers in obscure venues that get rapidly cited deserve attention.
The 90/10 rule: 10% of papers contain 90% of the long-term value. The reading skill is identifying that 10% quickly.
↳ §27.1 + Keshav 2007
The “replicate from description” test
Replicate-from-descriptioncritical reading methodA test of how well-described a method is in a paper. After reading, try to mentally (or in code) implement the method using ONLY the paper's description. If you can't, the paper is missing key details. Important details often hidden in footnotes, appendices, or 'we use standard hyperparameters' (which usually aren't standard at all). A paper that fails this test should be read with skepticism; either crucial details are missing or the result depends on tribal knowledge not in the paper. is the toughest test:
The test:
After reading the paper, can you go to a fresh codebase and implement
the method using ONLY what's in the paper?
This requires understanding:
- Exact data preprocessing.
- Exact model architecture (sizes, normalisations, activations).
- Exact training recipe (optimizer, LR schedule, regularisation).
- Exact evaluation protocol.
If you can replicate: the paper is well-written and trustworthy.
If you can't: there are gaps — could be in the paper, or could be that
you'd need to consult the released code (which is fine if it exists).
But if the code is NOT released, you should treat the paper with caution.
Common gaps in ML papers:
- "we use the standard hyperparameters" (which?)
- "trained until convergence" (define convergence)
- "we observed similar results on multiple seeds" (provide them)
- "we tuned hyperparameters" (the search space)
— think, then check —
What to check:
1. What’s measured?
- Wall-clock time? Throughput? Or theoretical FLOPs?
- Per-token, per-batch, or per-step?
- On what hardware? CPU? GPU? Specific model?
“3.5× speedup” without specifying the metric is meaningless.
2. Compared to what baseline?
- Is baseline X the STRONGEST baseline, or a deliberately-weakened one?
- Did the baseline get the same tuning effort?
- Is it state-of-the-art, or 2 years old?
3. Under what conditions?
- Same input size? Same precision?
- Same hardware (not “our method on A100, baseline on V100”)?
- Same software stack?
- Cold-start vs warm? Single query vs batch?
4. How was it measured?
- Multiple runs with statistical significance?
- Run on standard benchmarks (MLPerf, etc.)?
- Or a custom benchmark that may not generalise?
5. What’s the failure mode?
- Does the method break at large scales? Small batches? Long sequences?
- Does it require specific data distributions?
- Does it lose quality (recall, accuracy) in exchange for speed?
6. Implementation quality?
- Is the proposed method hand-optimised while the baseline is naive?
- E.g., custom CUDA kernel for proposed method vs PyTorch eager for baseline.
- This is the most common form of apparent “speedup” that doesn’t replicate.
7. Energy / cost / not just time?
- Does the method use 3.5× more memory for the speedup? More energy?
- Speedup at 10× memory cost may not be a net win.
If most of these check out: the claim is real.
If several are missing or evasive: the claim is exaggerated. The actual speedup might be 1.2× or even no speedup at all when fairly measured.
The dirty secret:
Many ML “speedup” claims are 1.5-3× real (genuine improvement) but reported as 5-10× by careful framing. A 5× reported claim is typically a 2× real claim. Calibrate accordingly.
↳ §27.1 + critical reading
Practical tips
Things that help:
1. KEEP NOTES. For each paper Pass 2'd, write 1 paragraph summary:
claim, evidence, your assessment, key formulas/figures.
2. RE-READ AT LEAST 1 PAPER PER MONTH. Returning to seminal papers
after a year reveals depth you missed. Examples: read Vaswani 2017
after writing your first transformer; you'll see new things.
3. SET UP READING CIRCLES. A weekly paper discussion with 3-5 people
forces deeper reading. Different people catch different things.
4. READ THE CODE. For important papers, the released code is the
ground truth. Reading both paper and code reveals what the paper
hides and what the code does differently.
5. DON'T BE INTIMIDATED BY MATH. The math in most ML papers is
simpler than the introduction makes it look. Work through it
step by step. Most "advanced" math is calculus + linear algebra
with notation.
6. SKIP THE RELATED WORK SECTION ON FIRST READING. It's often
a list of "see also." Focus on the contribution, not the bibliography.
7. FOCUS ON THE METHOD AND EXPERIMENTS. The intro and conclusion
can be skimmed; the substance is in §3 (method) and §4-5 (experiments).
— think, then check —
Diagnosis steps:
1. Re-read the paper. Are there hyperparameters not in the main text but in the appendix? Footnotes? “We use the standard recipe” usually masks crucial details.
2. Compare to released code (if available). Does your implementation match the released code line-by-line? Common discrepancies:
Different random seed (try multiple seeds).
Different optimizer (AdamW vs Adam vs SGD).
Different LR schedule (cosine vs linear vs constant).
Different data preprocessing (tokenisation, padding, masking).
Different evaluation protocol (batch size, length, sampling).
3. Reach out to authors. Be polite, specific. “I followed the description in section 3.2 but got X instead of Y. Can you point to what I might be missing?“
4. Look for community replications. Has anyone else replicated this? GitHub issues on the paper repository often have similar replication discussions.
5. Test your assumption stack. What “facts” are you treating as given that might be wrong? Examples:
You’re using bf16, paper might have used fp32.
You’re using a 7B model, paper might have used 7.5B or 8B (different architecture details matter).
You’re evaluating on test set, paper might have used dev set.
6. 3% gap could be:
Unstated hyperparameter difference (most common).
Lucky run by the original authors (they reported best of N).
Tuning effort difference (you tried 10 configs, they tried 1000).
Reported number is suspiciously round (e.g., 0.85 exactly).
Paper claims “consistently” but only shows one run.
Authors don’t respond to detailed inquiries.
Multiple replications fail.
8. Most common reality:
The 3% gap is REAL and represents “tuning effort” or “lucky seed.” The authors did substantial tuning that they didn’t fully document. Your implementation is correct; the difference is in the search budget.
What to do: report the gap. “Replicated within 3% of paper’s reported result; gap likely due to undocumented hyperparameter tuning.” This is honest and informative.
The bigger lesson:
The “replication crisis” in ML is real. Many “robust” results don’t replicate cleanly when reproduced from scratch by independent teams. As a reader, treat single-paper claims with appropriate skepticism. Trust the claim more when:
Multiple independent teams have replicated.
Code is released and matches paper claims.
The result has held up over multiple subsequent papers.
The methodology is well-documented.
The papers that survive replication are the ones that go on to be useful. The rest are interesting but don’t change practice.
↳ §27.1 + replication crisis
Next: §27.2 — What frontier-lab work actually looks like day-to-day. The reality of research at OpenAI, Anthropic, Google, etc. — what’s in the papers vs what isn’t.