Platform engineer, MLOps
Company: writer.com
Location: San Francisco
Posted on: June 1, 2025
Job Description:
About WriterWriter is the full-stack generative AI platform
delivering transformative ROI for the world's leading enterprises.
Named one of the top 50 companies in AI by Forbes and one of the
best places to work by Inc. Magazine, Writer empowers hundreds of
customers like Accenture, Intuit, L'Oreal, Mars, Salesforce, and
Vanguard to transform the way they work.Writer's fully integrated
solution makes it easy to deploy secure and reliable AI
applications and agents that solve mission-critical business
challenges. Our suite of development tools is powered by Palmyra -
Writer's state-of-the-art family of LLMs - alongside our
industry-leading graph-based RAG and customizable AI
guardrails.Founded in 2020 with office hubs in San Francisco, New
York City, Austin, Chicago, and London, our team of over 250
employees thinks big and moves fast, and we're looking for smart,
hardworking builders and scalers to join us on our journey to
create a better future of work.About this roleAs a Platform
Engineer, MLOps, you will be critical to deploying and managing
cutting-edge infrastructure crucial for AI/ML operations, and you
will collaborate with AI/ML engineers and researchers to develop a
robust CI/CD pipeline that supports safe and reproducible
experiments. Your expertise will also extend to setting up and
maintaining monitoring, logging, and alerting systems to oversee
extensive training runs and client-facing APIs. You will ensure
that training environments are optimally available and efficiently
managed across multiple clusters, enhancing our containerization
and orchestration systems with advanced tools like Docker and
Kubernetes.This role demands a proactive approach to maintaining
large Kubernetes clusters, optimizing system performance, and
providing operational support for our suite of software solutions.
If you are driven by challenges and motivated by the continuous
pursuit of innovation, this role offers the opportunity to make a
significant impact in a dynamic, fast-paced environment.Your
responsibilities:
- Work closely with AI/ML engineers and researchers to design and
deploy a CI/CD pipeline that ensures safe and reproducible
experiments.
- Set up and manage monitoring, logging, and alerting systems for
extensive training runs and client-facing APIs.
- Ensure training environments are consistently available and
prepared across multiple clusters.
- Develop and manage containerization and orchestration systems
utilizing tools such as Docker and Kubernetes.
- Operate and oversee large Kubernetes clusters with GPU
workloads.
- Improve reliability, quality, and time-to-market of our suite
of software solutions.
- Measure and optimize system performance, with an eye toward
pushing our capabilities forward, getting ahead of customer needs,
and innovating for continual improvement.
- Provide primary operational support and engineering for
multiple large-scale distributed software applications.Is this you?
- You have professional experience with:
- Model training
- Huggingface Transformers
- Pytorch
- vLLM
- TensorRT
- Infrastructure as code tools like Terraform
- Scripting languages such as Python or Bash
- Cloud platforms such as Google Cloud, AWS or Azure
- Git and GitHub workflows
- Tracing and Monitoring
- Familiar with high-performance, large-scale ML systems.
- You have a knack for troubleshooting complex systems and enjoy
solving challenging problems.
- Proactive in identifying problems, performance bottlenecks, and
areas for improvement.
- Take pride in building and operating scalable, reliable, secure
systems.
- Familiar with monitoring tools such as Prometheus, Grafana, or
similar.
- Are comfortable with ambiguity and rapid change.Preferred
skills and experience:
- 5+ years building core infrastructure.
- Experience running inference clusters at scale.
- Experience operating orchestration systems such as Kubernetes
at scale.Curious to learn more about who we are and how we operate?
Benefits & perks
- Generous PTO, plus company holidays.
- Medical, dental, and vision coverage for you and your
family.
- Paid parental leave for all parents (12 weeks).
- Fertility and family planning support.
- Early-detection cancer testing .
- Flexible spending account and dependent FSA options.
- Health savings account for eligible plans with company
contribution.
- Annual work-life stipends for:
- Home office setup, cell phone, internet.
- Wellness stipend for gym, massage/chiropractor, personal
training, etc.
- Learning and development stipend.
- Company-wide off-sites and team off-sites.
- Competitive compensation, company stock options and 401k.Writer
is an equal-opportunity employer and is committed to diversity. We
don't make hiring or employment decisions based on race, color,
religion, creed, gender, national origin, age, disability, veteran
status, marital status, pregnancy, sex, gender expression or
identity, sexual orientation, citizenship, or any other basis
protected by applicable local, state or federal law. Under the San
Francisco Fair Chance Ordinance, we will consider for employment
qualified applicants with arrest and conviction records.By
submitting your application on the application page, you
acknowledge and agree to .
#J-18808-Ljbffr
Keywords: writer.com, Brentwood , Platform engineer, MLOps, Engineering , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...