Development

Senior Dev Ops- Site Reliability Engineer

Preferable Location(s): Mumbai, India
Work Type: Full Time

We’re looking for a hands-on, self-directed Senior DevOps  Engineer to join our fast-paced startup. You’ll be the first line of defense for production issues, architect robust observability systems, and improve deployment and testing practices. If you thrive in startup environments, enjoy taking ownership, and are comfortable in modern JS/TS stacks, we’d love to meet you.


Top Outcomes – First 3 Months

Implement a reliable observability stack: Leverage Grafana, CloudWatch, and OpenTelemetry within our Node.js and TypeScript codebase.

Be on top of alerts and issues: Monitor, triage, fix or escalate production issues with traceability and follow-up.

Reduce system noise: Begin reducing the frequency and volume of unexpected errors.


Top Outcomes – First 12 Months

Improve test coverage: Ensure better code quality and proactively catch regressions.

Own DevOps workflows: Deploy, debug, and maintain infrastructure health autonomously.

Become a core team member: Handle incidents independently and support the evolution of our infra/dev culture.


Key Performance Indicators (KPIs)

Leading Indicators:

Number of alerts and incidents triaged

Trace IDs investigated and logged

Bugs found early and resolved

Tickets opened/closed efficiently

Reduced volume of unhandled or duplicate errors

Lagging Indicators:

Production uptime and stability

% fixes resolved without handoff

Number of tests added

Reduction in recurring or duplicate issues


Core Responsibilities

Observability & Alerting

Maintain and enhance Grafana dashboards

Integrate and manage CloudWatch alarms and OpenTelemetry traces

Ensure traceability across all systems (CRM, APIs, webhooks, workflows)

Issue Response & Triage

Act as first responder for production issues during working hours

Troubleshoot, escalate with full context, and coordinate incident response

Infrastructure Maintenance

Improve deployment workflows and monitor resource usage

Maintain the health of critical subsystems (queues, sync jobs, memory/cpu)

Testing & QA

Add and improve test coverage once baseline reliability is achieved

Build confidence in deployments through automated testing and regression checks


Candidate Profile

Strong experience with Node.js, TypeScript, and React

Deep knowledge of AWS, Grafana, OpenTelemetry, and CloudWatch

Prior startup experience preferred

Clear, proactive communicator with a bias toward ownership

Available  1:30 AM to 10:30 PM IST 5 days/week for on-call responsibilities

Bonus: Experience reviewing pull requests and deploying code regularly


Immediate Tasks

Review and phase-implement an internal RFC for observability

Refine and own Grafana dashboards; implement meaningful alerts

Ensure consistent trace ID usage throughout the codebase

Improve logging and tracing to increase debuggability

Monitor and respond to production errors daily

Investigate, fix, or escalate recurring system issues



Submit Your Application

You have successfully applied
  • You have errors in applying