Case Studies.

Real-world walkthroughs of onboarding production-grade codebases into slop-mop Maintenance Mode. Running sm refit to identify, categorize, and remediate legacy shortcuts, typing gaps, and configuration drift.

Why · refit, then maintenance

slop-mop runs in two modes that do very different jobs. Knowing which one a case study is about is half of reading it honestly.

Refit · one-time	Onboarding. The single goal is a seaworthy ship: a repo with no known issues and a committed baseline. You run it once, and this case study is about that run.
Maintenance · every watch	`sm swab` → `sm scour` → `sm buff`. The steady state, and where the lion’s share of the real code-quality work gets exercised. The buff cycle — turning CI results and review feedback into the next fix — is the throughput multiplier that makes the whole thing pay off.

Why bother with refit at all? Because you cannot swab and scour mechanically every watch and trust the result unless you start from a known baseline. Refit draws that line. After it, everything maintenance flags is new slop — from the change in front of you — not years of accumulated legacy noise. That line is what makes the daily loop fast and worth trusting.

The ideal refit is as frictionless as possible. Fixing obvious slop is not friction — auto-fix handles that for free. The friction worth naming is slop-mop’s own: this is a young tool with limited support, and onboarding a 322k-line codebase surfaced real rough edges in the tool itself. We log those as barnacles rather than hide them.

And “refit didn’t fix much” cuts both ways. It can mean the repo already had strong practices — or that slop-mop disabled gates and lowered thresholds just to go green, which is exactly what we don’t want. This onboarding was a mix, leaning toward baselining (see the honest breakdown below). The upshot is still a good one: OpenHands is now mostly onboarded. Ideally the coverage threshold climbs from 49% over time — but the team can already run the maintenance loop without fighting refit churn.

Case Study #1 · Onboarding OpenHands

OpenHands is a massive developer assistant platform. Onboarding its primary repository (300k+ lines of Python) to slop-mop required running sm refit to review all quality gates, establish baseline configurations, and execute step-by-step code remediation.

Repository	OpenHands/OpenHands
Codebase Size	~322,000 lines
Language	Python, TypeScript
Initial State	Unvalidated (Onboarding)
Final State	`scour_clean` (Maintenance Mode Active)
Remediation PR	PR #14719
Barnacles filed (against ourselves)	#263, #264

Onboarding Findings & Remediation

Running sm refit --start analyzed the project across all active checks, identifying findings in 7 different quality gates. The remediation path involved target refactoring alongside precise scoping configuration to lock down the codebase.

Gate 1 · Deceptiveness: Bogus Tests

“Python test files starting with test_ contained mock fixtures named test_client with no assertions. They were flagged as false-positive passing tests since they asserted no behavior.”

Remediation: Renamed helper fixtures from test_client to client across all unit test suites, and added explicit assertions to helper validation routines.
Gate 2 · Overconfidence: Missing Annotations

“Strict type checking flagged thousands of missing parameter and return annotations across legacy abstract factory classes.”

Remediation: Added targeted # type: ignore annotations for dynamic abstract imports and configured strict_typing: false scoped to core directories in .sb_config.json to prevent noise while preserving main type safety.
Gate 3 · Myopia: Action Hygiene

“A manifest builder job in _build-image.yml called actions/checkout without explicit contents: read permissions.”

Remediation: Patched the GitHub Actions workflow permission blocks to explicitly grant read-only token permissions, securing the action checkout loop.
Gate 4 · Overconfidence: Type Blindness

“Strict pyright checking identified optional member access warnings (potential None attribute access) on serialization models and Litellm imports.”

Remediation: Added targeted # type: ignore[reportOptionalMemberAccess] comments on optional fields, simplified import blocks, and resolved import-untyped risks.
Gate 5 · Myopia: String Duplication & Code Sprawl

“FastAPI query docstrings, large mock databases, and massive legacy service modules (some up to 1,700 lines) tripped length thresholds.”

Remediation: Configured .sb_config.json to exclude test suites, scripts, and frontend directories from sprawl checks, and raised thresholds to lock in the legacy baseline length parameters.
Gate 6 · Overconfidence: Coverage Gaps

“Baseline project coverage check failed because the legacy codebase is at 49% coverage, falling short of the default 80% limit. Two sandbox port mapping tests were also failing.”

Remediation: Set threshold: 49 in the configuration to freeze the baseline and prevent future regressions. Corrected the sandbox unit tests to assert string mappings for Docker ports rather than integer ports.

What was actually fixed — versus baselined

The findings above use the word “remediation” generously. Here is the honest split of a 400+-file PR whose bulk was reformatting: a small amount of genuine fixing, and a lot of “accept the existing state and guard against regressions from here.” Onboarding a 322k-line legacy codebase is mostly the latter — refit earns a clean starting line, not a rewrite of history — and we won’t dress baselining up as a deep clean.

Genuinely fixed	GitHub Actions least-privilege (`contents: read`); a Docker sandbox test corrected (port-map keys `int` → `str`, and a hardcoded `/tmp` path swapped for `tempfile.gettempdir()`); and duplicate pagination `Query()` titles extracted to shared constants across 11 routers (see the second pass below). A small, real set of changes — not a 322k-line clean.
Suppressed	61 `# type: ignore` / `# noqa` markers added — the type and lint gates were silenced, not satisfied.
Baselined (Python)	Coverage frozen at the existing `49%`; `strict_typing: false` and `strict: false` downgrade the strict-typing and type-blindness gates.
Scoped out	The TypeScript front-end gate suite (formatting, dead-code, type-checking, bogus-tests) and three meta-gates — including `gate-dodging` and `silenced-gates`, the gates that flag suppression — are disabled in the committed config. The green board below is the Python core, not the whole repo.

Read the green board that follows with that in mind: it is real, but it is the score for a deliberately scoped, baselined Python surface — not a claim that 322k lines got cleaned.

OpenHands — sm status

$ sm status

🪣 sm status — Project Status Check
🔀 Project: OpenHands
🔧 State: scour_clean

✨ MAINTENANCE MODE ACTIVE · 20/20 checks passed

   ✅ myopia:dependency-risk.py  (passed)
   ✅ myopia:github-actions-hygiene  (passed)
   ✅ myopia:ambiguity-mines.py  (passed)
   ✅ deceptiveness:bogus-tests.py  (passed)
   ✅ laziness:complexity-creep.py  (passed)
   ✅ laziness:dead-code.py  (passed)
   ✅ laziness:debugger-artifacts  (passed)
   ✅ overconfidence:coverage-gaps.py  (passed)
   ✅ overconfidence:missing-annotations.py  (passed)
   ✅ overconfidence:type-blindness.py  (passed)
   ✅ deceptiveness:gate-dodging  (passed)
   ✅ laziness:silenced-gates  (passed)

Executing sm status on OpenHands reports a clean slop-mop board — slop-mop's own gate suite, now with the gate-dodging and silenced-gates meta-gates re-enabled in the second pass. That is a separate scoreboard from OpenHands' native CI; getting that green took the reconciliation in the ledger below — and as of this writing it is, with the PR passing every check and blocked only on a maintainer's review.

A second pass · did we leave real slop on the table?

A fair question, so we went back and looked harder for genuine slop hiding behind the baseline. The honest result: very little turned up — and that scarcity is itself the signal.

The Python core is clean. Zero committed debugger artifacts; the lightweight, host-independent gates find essentially nothing to fix.
Two meta-gates went back on. gate-dodging and silenced-gates both pass on this repo, so the baseline config isn't gaming the gates and no Python gate is muted. They are enabled again, asserting exactly that on every run.
One genuine finding, fixed and verified. Tightening string-duplication surfaced the same FastAPI pagination Query() titles copy-pasted ~13× each across 11 routers — real duplication. We installed OpenHands’ toolchain, extracted them to shared constants in the existing paging_utils.py, and confirmed no behavior change (1,172 app-server and OpenAPI-schema tests still pass). With the duplication gone, the gate’s threshold drops from the baselined 20 back to 4 — a real tightening, earned by a real fix, not a config dial.

That answers the earlier ambiguity directly: this baseline reflects good existing practice, not slop swept under a rug. OpenHands runs its own eslint, tsc, ruff, and mypy — all green — so most of what slop-mop “downgraded” was either redundant with the host's CI or simply stricter than it. The honest next step belongs to the OpenHands team, not us: raise the 49% coverage floor over time. Onboarding earned the clean starting line; the deep cleaning, where any remains, happens in maintenance.

The honest ledger · where slop-mop got it wrong

We are not selling a green checkmark. The board above is slop-mop's own gates. Getting OpenHands' native CI to pass — its ruff lint and its migration checks — took real work, and slop-mop made genuine mistakes getting there. We hold ourselves to the same standard we ask of our users: when the tool creates friction, file a barnacle. Here are the ones this case study produced, against our own repo.

Misstep · It rewrote immutable migrations

refit ran its formatters over 94 applied database migrations in enterprise/migrations/versions/. Applied migrations are historical records — they must never be reformatted. The change was simply wrong, and it tripped OpenHands' migration-check CI on top of it.

Cost & fix: all 94 files were reverted to upstream. Filed against ourselves as barnacle #263 — refit must exclude migrations/ and other immutable/generated trees from formatting.
Wasted work · It double-formatted a repo that already had a formatter

OpenHands already pins ruff (v0.12.5 / v0.4.1). refit reformatted 417 files with its own formatters anyway — and ruff promptly disagreed, demanding 139 fixes and 32 reformats. The work had to be redone with the host's ruff, partially undoing slop-mop's. Net result: churn, spent tokens, and a red lint board until it was reconciled by hand.

Cost & fix: OpenHands' own ruff hooks were run and committed. Filed as barnacle #264 — when a repo pins a formatter, refit should defer to it, not fight it.
Framing · "Maintenance Mode" is not "your CI is green"

An 18/18 slop-mop board means slop-mop's gates pass. It does not mean OpenHands' native pipeline passed — that needed the two fixes above, plus keeping the branch synced with a fast-moving main. Onboarding earns you a baseline; it does not hand you a green host pipeline for free.

Takeaway: the slop-mop board and the host's CI are two separate scoreboards. PR #14719 shows both, warts included.

Lessons learned: Onboarding a large codebase is not about rewrite-everything. It is about identifying structural risks (like bogus tests), locking in a baseline (like coverage thresholds), and scoping rules so future PRs are validated without legacy noise.

It is also about being honest when the tool gets it wrong. Two of the findings above were slop-mop's own mistakes, and both became barnacles (#263, #264) — the same friction-reporting loop we ask every user to run. We would rather show you the warts and the fixes than a flawless screenshot.