The CRM hygiene cron that runs nightly.

Dedupe contacts. Normalize emails. Fix country codes. Archive stale leads. Idempotent and resumable. Runs at 3am, finishes by 4am.

9 min read

The pain

Why your CRM gets messy

Six months in, HubSpot has 50,000 contacts. 3,000 are duplicates. 800 have invalid email formats. 1,200 have country codes that don't match an ISO standard. 12,000 are leads that haven't responded in 18 months.

The four reasons hygiene scripts get abandoned:

They take too long. A naive script processes 50k contacts serially. It runs for 2 hours, hits a rate limit at hour 1.5, and dies.
They're not resumable. When the script dies, you start over. You re-process the first 30,000 contacts every time.
They corrupt data. A bug in the dedup logic merges two real customers because the email matched. The audit trail is missing. Recovery requires support tickets.
They're scary to run. Engineer wrote it once, ran it once, never ran it again because the cost of a mistake is too high.

The architecture

What we're building

A nightly cron that processes the CRM in batches, with full audit logging and dry-run support.

Batched. Process 100 contacts per step. Memoize after each batch. A failure mid-job resumes from the last completed batch.
Idempotent. Running the cron twice in one night is safe. Re-normalizing an already-normalized email is a no-op. Re-archiving an already-archived lead is a no-op.
Audit logged. Every change writes a row to a crm_audit table. If hygiene merges the wrong contacts, you can see exactly what happened and reverse it.
Dry-run by default. First run mode is --dry-run. The job logs what it would change without making the change. Engineer reviews. Then runs for real.

workflow({
  id: "crm-hygiene-nightly",
  trigger: { schedule: "0 3 * * *" },     // every night at 3am
  concurrency: { limit: 1 },               // never run two at once
  steps: async ({ step, logger }) => {
    const dryRun = process.env.HYGIENE_DRY_RUN === "true";

    // Phase 1: dedupe
    await runInBatches(step, "dedupe", () => findDuplicates(), 100, async (batch) => {
      for (const dup of batch) {
        await step.run(`dedupe-${dup.merged_into}-${dup.duplicate_id}`, () =>
          dryRun
            ? logger.info("would merge", dup)
            : mergeContacts(dup.duplicate_id, dup.merged_into)
        );
      }
    });

    // Phase 2: normalize emails
    await runInBatches(step, "normalize-emails", () => findUnnormalizedEmails(), 100, async (batch) => {
      for (const c of batch) {
        await step.run(`norm-email-${c.id}`, () => normalizeEmail(c, dryRun));
      }
    });

    // Phase 3: fix country codes
    await runInBatches(step, "fix-countries", () => findBadCountryCodes(), 100, async (batch) => {
      for (const c of batch) {
        await step.run(`fix-country-${c.id}`, () => fixCountryCode(c, dryRun));
      }
    });

    // Phase 4: archive stale leads
    await runInBatches(step, "archive-stale", () => findStaleLeads(), 100, async (batch) => {
      for (const c of batch) {
        await step.run(`archive-${c.id}`, () => archiveLead(c, dryRun));
      }
    });
  },
});

The dedup logic

Get this right or do nothing

Dedup is the highest-stakes phase. A wrong merge corrupts customer data permanently. The rule: merge only when at least two high-confidence signals match.

function findDuplicates(): Promise<Array<{merged_into: string, duplicate_id: string}>> {
  return db.query(`
    SELECT
      a.id AS duplicate_id,
      b.id AS merged_into
    FROM contacts a, contacts b
    WHERE a.id < b.id
      AND (
        -- Two of three must match
        (LOWER(a.email) = LOWER(b.email)) +
        (LOWER(a.phone) = LOWER(b.phone) AND a.phone IS NOT NULL) +
        (
          LOWER(a.first_name) = LOWER(b.first_name)
          AND LOWER(a.last_name) = LOWER(b.last_name)
          AND a.company_id = b.company_id
        )
      ) >= 2
      AND b.created_at < a.created_at  -- keep the older one
  `);
}

Email alone is not enough. Phone alone is not enough. Two of three identity signals must agree, and the older record wins. The audit row records both before and after states so you can reverse a merge if needed.

Edge cases

What goes wrong, and how to handle it

HubSpot rate limit at hour 1.5. Each batch is a step. Rotor backs off and retries the failed batch. Successful batches are memoized — they don't re-run.

Enrichment vendor changes their API mid-run. Wrap the third-party call in a step that catches schema errors and writes them to a quarantine table. Don't fail the workflow.

Wrong merge happens. The audit table has the before-state for every change. Run a reverse cron from a Linear ticket — read the audit row, re-create the merged-out record, un-link.

Hygiene starts archiving active leads. The "stale" definition needs at least three signals: no email open in 18mo, no form submission in 12mo, no Salesforce activity in 12mo. If any are missing, alert and skip — don't archive.

The math

What this costs on Rotor

50,000 contacts processed nightly. ~2,000 actual changes per night across all four phases. Each change = 1 step-run. ~2,000 step-runs/night = 60,000/month.

That fits Rotor Pro ($99/mo, 100k included). On alternatives:

Building it on a cron + script: free in compute, but every failure becomes your problem at 3am. The audit table and dry-run mode are nice-to-haves you'll never write.
Zapier: 60k tasks/month is in the $200-400/mo range. The batching and resumability are not native primitives.
n8n self-hosted: free in licensing, but you maintain the queue, the state, and the recovery logic.

Fork this playbook on Rotor.

$9 to start. 30-day money back. Hard caps protect you from runaway bills.

Start shipping