How I replaced a vendor moderation task with an LLM-powered pipeline, cutting costs by 99.4%.

Background

Since 2019, my team relied on a vendor moderation task to help measure location extraction accuracy. Human reviewers examined job postings and verified whether the extracted location was correct. This data fed into precision metrics tracking our geocoding quality over time.

The task answered a few key questions:

How accurate is the location data we’re collecting from employers?
Are we capturing the most precise location possible (street address vs. city)?
What percentage of jobs are single-location vs. remote or multi-location?

The limit: past budget constraints limited the task to ~$5,000/month, restricting volume of data we could collect and our market coverage. We could only afford to process jobs in three markets, and even then at a reduced scale.

With LLM capabilities maturing, this seemed like a good opportunity to transform our vendor task.

I think I was mostly excited because rather than adding “AI” for its own sake, this was a real case where recent improvements in LLM reliability (pinning models, setting the temperature, frequency penalty, etc - making responses more deterministic and stable), plus the lower cost made it pragmatic to test out the new tech and actually improve our process significantly.

Results

Metric	Before (Vendor)	After (AI Pipeline)
Monthly Cost	~$5,000	~$30
Cost Reduction	—	99.4%
Annual Savings	—	~$89,000
Daily Volume	Limited by budget	4,200+ jobs/day
Markets	3 (constrained)	3 (expandable)

The pipeline uses GPT-4o-mini at $0.0002 per job — roughly $1 per 5,000 jobs processed.

Accuracy Validation

Before replacing the vendor process, I needed to prove the model could match our vendor’s accuracy. I ran the pipeline against historical data where we had “ground truth” from the vendor labels:

Market	AI Accuracy	Sample Size
US	95.0%	843 jobs
CA	98.0%	1,550 jobs
GB	96.9%	2,188 jobs

These numbers matched or exceeded the vendor’s calibrated accuracy of ~95%. It took a few attempts at crafting a strong enough prompt to handle edge cases, but the model performed surprisngly well. Even the earliest iterations were > 80% accurate out of the gate.

Pipeline Architecture

I designed the system as a library + scheduled job pattern, separating reusable logic from deployment infrastructure:

Component	Purpose
Pipeline Library	Reusable extractors, classifiers, validation logic
Production Cronjob	Deployment wrapper with infrastructure integration

This separation enables:

Reusability — The library can be imported by other projects or used for local testing
Independent versioning — Library updates can be validated before production deployment
Clean CI/CD — Each component has focused build/deploy pipelines

Data Flow

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Data Warehouse │ ──▶ │  LLM Extraction  │ ──▶ │   Classification    │
│  (Job Sample)   │     │  (GPT-4o-mini)   │     │  (Remote/Multi-loc) │
└─────────────────┘     └──────────────────┘     └─────────────────────┘
                                                           │
                                                           ▼
                                                 ┌─────────────────────┐
                                                 │   Analytics Index   │
                                                 │   (47 fields)       │
                                                 └─────────────────────┘

Stage 1: Data Ingestion

The pipeline samples jobs from our data warehouse, joining job metadata with full descriptions:

SELECT 
    job_id,
    feed_id,
    job_title,
    company,
    city,
    state,
    postal_code,
    job_description
FROM job_data
WHERE country = '{country}'
  AND created_date >= '{start_date}'
  AND feed_id NOT IN ({excluded_feeds})

Stage 2: LLM Extraction

Each job description is sent to GPT-4o-mini with a structured prompt requesting:

Primary work location (city, state, postal code)
Location confidence score
Whether location appears explicitly in the job text

The prompt engineering required some iteration. Early versions struggled with:

Jobs listing multiple office locations
Remote positions with residency requirements
HR locations

I refined the prompts based on error analysis until accuracy stabilized above 95%. I also git tracked the prompts themselves so we can test prompt variations over time if need be.

Example Prompt

You are an information extraction assistant. Extract the most specific primary work location explicitly mentioned in the job posting. Only copy text that exists in the context; do not infer, normalize, or reformat.

Context:
{{INSERT_CONTEXT}}

Question:
{{INSERT_QUESTION}}

Extraction rules (strict):
- Prefer the most specific location by this order:
  1) full street address including suite/building and postal code
  2) city, state, postal code
  3) city, state
  4) city only
  5) state only
  6) country only
- If a full street address appears anywhere in the posting, return that entire address (including suite/floor/building, punctuation, abbreviations, and ZIP/postal code) exactly as written.
  - When the location line uses a label + dash pattern (e.g., "Location: [site name] - [address] (qualifier)"), return only the address after the dash and exclude any trailing parenthetical qualifiers like "(Hybrid)".
  - When a street address is present, capture the contiguous span from the street number through the postal code, inclusive. Do not drop leading street name/number segments or suite/building designators.
  - If a venue/site/facility/organization name appears adjacent to an address (e.g., "[Venue] - 123 Main St..." or "[Venue], 123 Main St..."), exclude the venue name and return only the address portion beginning at the first street number.
  - Treat separators such as "-", "–", "—", ":", and "," as potential dividers between a venue name and the address; always capture from the first street number through the postal code.
  - Terminate the span at the end of the postal/ZIP code; if a country name immediately follows, include the country name, but exclude any parenthetical country codes (e.g., "(US)").
  - Do not consider venue-only or unit-only mentions without a street name (e.g."#5 Woodfield Shopping Center") as a full street address; when no street name is present, fall back to the next granularity (e.g., city, state, postal).
  - Prefer locations explicitly labeled with "Location:", "Work Location", "Primary Location", "Office", "Site" near the role header over addresses in footer/contact sections.
  - If multiple candidates exist at the same granularity, choose the one most clearly tied to the job's work location (by label or proximity to the job title/overview).
  - Do not invent missing parts. If no higher-granularity location exists, fall back to the next level in the order above.
  - Exclude emails, phone numbers, URLs, and HR or corporate mailing blocks not tied to the job location.

Response format (JSON only):
- Return JSON only (no prose, no markdown).
- Use exactly this object and do not add extra keys:
{
  "explanation": string,
  "granularity": "full_street" | "city_state_postal" | "city_state" | "city" | "state" | "country" | "none",
  "answer": string
}
- If no location is present, set "granularity" to "none" and "answer" to "UNKNOWN".

Stage 3: Classification

Beyond location extraction, the pipeline classifies jobs into categories:

Remote/Hybrid/Onsite Classification

1,487 pattern-matching rules for common remote indicators
A custom TF-IDF model for ambiguous cases
Confidence scoring for downstream filtering

Non-Single Location Detection

Identifies transient jobs (delivery, trucking, sales territories)
Flags multi-site positions where “location” is ambiguous

These were past side projects of mine that I incorporated into the pipeline to enrich the final data product.

Stage 4: Index Upload

Results are uploaded to an analytics index with ~40 documented fields, enabling ad-hoc queries and monitoring for the team and stakeholders:

-- What percentage of jobs have precise location data?
SELECT 
    country,
    location_granularity,
    COUNT(*) as jobs,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY country) as pct
FROM ai_location_match
WHERE process_date >= CURRENT_DATE - 30
GROUP BY country, location_granularity

Cost Analysis Methodology

To help justify the project, I ran the numbers:

Vendor Model (Before)

Per-task pricing from our moderation vendor
Our volume capped by monthly budget allocation
Additional costs/timesink for quality training and syncs

AI Model (After)

LLM API cost: $0.15 per 1M input tokens, $0.60 per 1M output tokens
Average job description: ~800 tokens
Average extraction: ~200 tokens
Per-job cost: $0.0002

At 4,200 jobs/day × 30 days = 126,000 jobs/month:

Vendor cost: ~$5,000
AI cost: ~$25-30

The 99.4% cost reduction meant the project paid for development time within the first month of operation.

Lessons Learned

Start with accuracy benchmarks. Having historical vendor data to compare against made model validation straightforward. Without the historical data, proving the model accuracy would have required building a golden dataset from scratch but would still have been critical to building a reliable output.

Iterate on prompts to fine tune. The first prompt hit about ~80% accuracy. Getting to 95%+ required analyzing the edge cases where the model tripped up and refining instructions with few-shot examples to draw from.

Separate library from deployment runner. The library + cronjob pattern simplified testing and enabled other teams to reuse the extraction logic for their own use cases. I wasn’t sure if this was overkill at first, but the library is already being used by other projects.

Model cost at the start. Showing a $90K annual savings made leadership approval easier.

Design for self-serve analytics. The 47-field output schema was designed around the questions stakeholders actually ask with our prior task. The index now powers a dashboard to help us monitor location extraction accuracy.

Takeaways

Overall this project was a big win for our team. It demonstrated what’s possible when you incorporate LLMs intelligently into problem spaces where they excell. We achieved actual cost savings and majorly upgraded one of our most important data products.

Definitely learned a lot from this work. Excited to already be leveraging this workflow for another project related to sentiment analysis of location feedback at scale for our Product teams.

Replacing a $90K/Year Manual Vendor Process using AI