Case Studies.
Real-world walkthroughs of onboarding production-grade codebases into slop-mop Maintenance Mode. Running sm refit to identify, categorize, and remediate legacy shortcuts, typing gaps, and configuration drift.
Why ยท refit, then maintenance
slop-mop runs in two modes that do very different jobs. Knowing which one a case study is about is half of reading it honestly.
| Refit ยท one-time | Onboarding. The single goal is a seaworthy ship: a repo with no known issues and a committed baseline. You run it once, and this case study is about that run. |
|---|---|
| Maintenance ยท every watch | sm swab → sm scour → sm buff. The steady state, and where the lion’s share of the real code-quality work gets exercised. The buff cycle โ turning CI results and review feedback into the next fix โ is the throughput multiplier that makes the whole thing pay off. |
Why bother with refit at all? Because you cannot swab and scour mechanically every watch and trust the result unless you start from a known baseline. Refit draws that line. After it, everything maintenance flags is new slop โ from the change in front of you โ not years of accumulated legacy noise. That line is what makes the daily loop fast and worth trusting.
The ideal refit is as frictionless as possible. Fixing obvious slop is not friction โ auto-fix handles that for free. The friction worth naming is slop-mop’s own: this is a young tool with limited support, and onboarding a 322k-line codebase surfaced real rough edges in the tool itself. We log those as barnacles rather than hide them.
And “refit didn’t fix much” cuts both ways. It can mean the repo already had strong practices โ or that slop-mop disabled gates and lowered thresholds just to go green, which is exactly what we don’t want. This onboarding was a mix, leaning toward baselining (see the honest breakdown below). The upshot is still a good one: OpenHands is now mostly onboarded. Ideally the coverage threshold climbs from 49% over time โ but the team can already run the maintenance loop without fighting refit churn.
Case Study #1 ยท Onboarding OpenHands
OpenHands is a massive developer assistant platform. Onboarding its primary repository (300k+ lines of Python) to slop-mop required running sm refit to review all quality gates, establish baseline configurations, and execute step-by-step code remediation.
| Repository | OpenHands/OpenHands |
|---|---|
| Codebase Size | ~322,000 lines |
| Language | Python, TypeScript |
| Initial State | Unvalidated (Onboarding) |
| Final State | scour_clean (Maintenance Mode Active) |
| Remediation PR | PR #14719 |
| Barnacles filed (against ourselves) | #263, #264 |
Onboarding Findings & Remediation
Running sm refit --start analyzed the project across all active checks, identifying findings in 7 different quality gates. The remediation path involved target refactoring alongside precise scoping configuration to lock down the codebase.
-
Gate 1 ยท Deceptiveness: Bogus Tests
“Python test files starting with
test_contained mock fixtures namedtest_clientwith no assertions. They were flagged as false-positive passing tests since they asserted no behavior.”Remediation: Renamed helper fixtures from
test_clienttoclientacross all unit test suites, and added explicit assertions to helper validation routines. -
Gate 2 ยท Overconfidence: Missing Annotations
“Strict type checking flagged thousands of missing parameter and return annotations across legacy abstract factory classes.”
Remediation: Added targeted
# type: ignoreannotations for dynamic abstract imports and configuredstrict_typing: falsescoped to core directories in.sb_config.jsonto prevent noise while preserving main type safety. -
Gate 3 ยท Myopia: Action Hygiene
“A manifest builder job in
_build-image.ymlcalledactions/checkoutwithout explicitcontents: readpermissions.”Remediation: Patched the GitHub Actions workflow permission blocks to explicitly grant read-only token permissions, securing the action checkout loop.
-
Gate 4 ยท Overconfidence: Type Blindness
“Strict pyright checking identified optional member access warnings (potential
Noneattribute access) on serialization models and Litellm imports.”Remediation: Added targeted
# type: ignore[reportOptionalMemberAccess]comments on optional fields, simplified import blocks, and resolved import-untyped risks. -
Gate 5 ยท Myopia: String Duplication & Code Sprawl
“FastAPI query docstrings, large mock databases, and massive legacy service modules (some up to 1,700 lines) tripped length thresholds.”
Remediation: Configured
.sb_config.jsonto exclude test suites, scripts, and frontend directories from sprawl checks, and raised thresholds to lock in the legacy baseline length parameters. -
Gate 6 ยท Overconfidence: Coverage Gaps
“Baseline project coverage check failed because the legacy codebase is at 49% coverage, falling short of the default 80% limit. Two sandbox port mapping tests were also failing.”
Remediation: Set
threshold: 49in the configuration to freeze the baseline and prevent future regressions. Corrected the sandbox unit tests to assert string mappings for Docker ports rather than integer ports.
What was actually fixed โ versus baselined
The findings above use the word “remediation” generously. Here is the honest split of a 400+-file PR whose bulk was reformatting: a small amount of genuine fixing, and a lot of “accept the existing state and guard against regressions from here.” Onboarding a 322k-line legacy codebase is mostly the latter โ refit earns a clean starting line, not a rewrite of history โ and we won’t dress baselining up as a deep clean.
| Genuinely fixed | GitHub Actions least-privilege (contents: read); a Docker sandbox test corrected (port-map keys int → str, and a hardcoded /tmp path swapped for tempfile.gettempdir()); and duplicate pagination Query() titles extracted to shared constants across 11 routers (see the second pass below). A small, real set of changes โ not a 322k-line clean. |
|---|---|
| Suppressed | 61 # type: ignore / # noqa markers added โ the type and lint gates were silenced, not satisfied. |
| Baselined (Python) | Coverage frozen at the existing 49%; strict_typing: false and strict: false downgrade the strict-typing and type-blindness gates. |
| Scoped out | The TypeScript front-end gate suite (formatting, dead-code, type-checking, bogus-tests) and three meta-gates โ including gate-dodging and silenced-gates, the gates that flag suppression โ are disabled in the committed config. The green board below is the Python core, not the whole repo. |
Read the green board that follows with that in mind: it is real, but it is the score for a deliberately scoped, baselined Python surface โ not a claim that 322k lines got cleaned.
$ sm status ๐ชฃ sm status โ Project Status Check ๐ Project: OpenHands ๐ง State: scour_clean โจ MAINTENANCE MODE ACTIVE ยท 20/20 checks passed โ myopia:dependency-risk.py (passed) โ myopia:github-actions-hygiene (passed) โ myopia:ambiguity-mines.py (passed) โ deceptiveness:bogus-tests.py (passed) โ laziness:complexity-creep.py (passed) โ laziness:dead-code.py (passed) โ laziness:debugger-artifacts (passed) โ overconfidence:coverage-gaps.py (passed) โ overconfidence:missing-annotations.py (passed) โ overconfidence:type-blindness.py (passed) โ deceptiveness:gate-dodging (passed) โ laziness:silenced-gates (passed)
Executing sm status on OpenHands reports a clean slop-mop board โ slop-mop's own gate suite, now with the gate-dodging and silenced-gates meta-gates re-enabled in the second pass. That is a separate scoreboard from OpenHands' native CI; getting that green took the reconciliation in the ledger below โ and as of this writing it is, with the PR passing every check and blocked only on a maintainer's review.
A second pass ยท did we leave real slop on the table?
A fair question, so we went back and looked harder for genuine slop hiding behind the baseline. The honest result: very little turned up โ and that scarcity is itself the signal.
- The Python core is clean. Zero committed debugger artifacts; the lightweight, host-independent gates find essentially nothing to fix.
- Two meta-gates went back on.
gate-dodgingandsilenced-gatesboth pass on this repo, so the baseline config isn't gaming the gates and no Python gate is muted. They are enabled again, asserting exactly that on every run. - One genuine finding, fixed and verified. Tightening
string-duplicationsurfaced the same FastAPI paginationQuery()titles copy-pasted ~13× each across 11 routers โ real duplication. We installed OpenHands’ toolchain, extracted them to shared constants in the existingpaging_utils.py, and confirmed no behavior change (1,172 app-server and OpenAPI-schema tests still pass). With the duplication gone, the gate’s threshold drops from the baselined20back to4โ a real tightening, earned by a real fix, not a config dial.
That answers the earlier ambiguity directly: this baseline reflects good existing practice, not slop swept under a rug. OpenHands runs its own eslint, tsc, ruff, and mypy โ all green โ so most of what slop-mop “downgraded” was either redundant with the host's CI or simply stricter than it. The honest next step belongs to the OpenHands team, not us: raise the 49% coverage floor over time. Onboarding earned the clean starting line; the deep cleaning, where any remains, happens in maintenance.
The honest ledger ยท where slop-mop got it wrong
We are not selling a green checkmark. The board above is slop-mop's own gates. Getting OpenHands' native CI to pass โ its ruff lint and its migration checks โ took real work, and slop-mop made genuine mistakes getting there. We hold ourselves to the same standard we ask of our users: when the tool creates friction, file a barnacle. Here are the ones this case study produced, against our own repo.
-
Misstep ยท It rewrote immutable migrations
refit ran its formatters over 94 applied database migrations in
enterprise/migrations/versions/. Applied migrations are historical records โ they must never be reformatted. The change was simply wrong, and it tripped OpenHands' migration-check CI on top of it.Cost & fix: all 94 files were reverted to upstream. Filed against ourselves as barnacle #263 โ refit must exclude
migrations/and other immutable/generated trees from formatting. -
Wasted work ยท It double-formatted a repo that already had a formatter
OpenHands already pins
ruff(v0.12.5 / v0.4.1). refit reformatted 417 files with its own formatters anyway โ and ruff promptly disagreed, demanding 139 fixes and 32 reformats. The work had to be redone with the host's ruff, partially undoing slop-mop's. Net result: churn, spent tokens, and a red lint board until it was reconciled by hand.Cost & fix: OpenHands' own ruff hooks were run and committed. Filed as barnacle #264 โ when a repo pins a formatter, refit should defer to it, not fight it.
-
Framing ยท "Maintenance Mode" is not "your CI is green"
An 18/18 slop-mop board means slop-mop's gates pass. It does not mean OpenHands' native pipeline passed โ that needed the two fixes above, plus keeping the branch synced with a fast-moving
main. Onboarding earns you a baseline; it does not hand you a green host pipeline for free.Takeaway: the slop-mop board and the host's CI are two separate scoreboards. PR #14719 shows both, warts included.
Lessons learned: Onboarding a large codebase is not about rewrite-everything. It is about identifying structural risks (like bogus tests), locking in a baseline (like coverage thresholds), and scoping rules so future PRs are validated without legacy noise.
It is also about being honest when the tool gets it wrong. Two of the findings above were slop-mop's own mistakes, and both became barnacles (#263, #264) โ the same friction-reporting loop we ask every user to run. We would rather show you the warts and the fixes than a flawless screenshot.