Lead Site Reliability Engineer

Bengaluru | SRE | Full-time

Apply

MoEngage is an intelligent customer engagement platform, built for customer-obsessed marketers and product owners. We enable hyper-personalization at scale across multiple channels like mobile push, email, in-app, web push, on-site massages, and SMS. With AI-powered automation and optimization, brands can analyze audience behavior and engage consumers with personalized communication at every touchpoint across their lifecycle.

Fortune 500 brands and Enterprises across 35 countries such as Deutsche Telekom, Samsung, Ally Financial, Vodafone, and McAfee along with internet-first brands such as Flipkart, Ola, OYO, Bigbasket, and Tokopedia use MoEngage to orchestrate their cross-channel campaigns and engage efficiently with their customers sending 50 billion messages to 500 million consumers every month!

Our vision is to build the world’s most trusted customer engagement platform for the mobile-first world.

We promise to care about your customers as much as you do. That justifies our top ratings for service and support in Gartner Magic Quadrant, Gartner Peer Insights, and G2 Summer Reports. We have also been recognized as one of the 25 Highest Rated Private Cloud Computing Companies To Work For in a list released by Battery Ventures, a global investment firm based on the employee feedback on Glassdoor where employees reported the highest levels of satisfaction at work during the first six months of the pandemic."

Here are some of the challenging areas you can expect to work as part of the SRE team :

  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system reliability.
  • Work closely with team members to ensure best practices and strategic goals are incorporated into development work.
  • Collaborate with other engineering teams to identify and anticipate changing requirements and opportunities to improve the development environment.
  • Monitoring at scale with VictoriaMetrics and the likes
  • Orchestrating and managing with K8S and the likes
  • Implementing best practices, challenging the status quo, and tab on industry and technical trends, changes, and developments to ensure the team is always striving for best-in-class work.
  • Manage capacity, build security into every layer, and reduce cost
  • Implement secure networking, key management, user management, access management, process management, and image management.
  • Effectively lead and manage team deliverable (short/long term) project planning and coaching, quarterly reviews, participation in the selection process for new hires, and technical and non-technical guidance to the team.

Skill Requirements:

  • Proven experience in handling large infrastructure and distributed systems like Yarn, Kubernetes, Elasticsearch, Kafka etc.
  • Familiarity with Python-related technologies and frameworks like Falcon, Django, or Pyramid.
  • Experience with Unix/Linux operating systems internals and administration (e.g. filesystems, inodes, system calls, etc.) or networking (e.g. TCP/IP, routing, network topologies, and hardware, SDN, etc.)
  • Familiarity with the  cloud computing infrastructure, preferably Azure
  • Familiarity with task queue frameworks like Celery or Pika is a plus.
  • Source code management and Implementation of security best practices.
  • Deep understanding of modern software architectures, including load-balancing, queueing, caching, distributed systems failure modes generally, microservices, and big data technologies
  • Know-how of gathering metrics across distributed systems (instances/container) & generating automated notifications, and reports.
  • Prowess in analyzing App bottlenecks, and performance degradation, and implementing automated processes/tools to detect such anomalies.
  • Good understanding & implementation experience using 12-factor App principles.

Mandatory Skills:

  • 8 - 10 years of Experience on the AWS/Azure platform.
  • Excellent programming (Python, Go, Ruby, or preferred scripting languages) and automation skills
  • Deep understanding of container orchestration technologies - Kubernetes
  • Should have had prior experience in migrating high throughput services to Kubernetes. 
  • Expertise in any CI/CD tools build, artifact, packaging, and service discovery management tools. Gitops preferred
  • Expertise in skillsets for centralized logging systems, metrics, and tooling frameworks such as ELK, Prometheus/VictoriaMetrics, and Grafana
  • Great communication, interpersonal, and teamwork skills.
  • Experience with AWS/Azure cost explorer, billing analysis, and various cost optimization techniques.
  • Awareness of Cloud Security concepts
  • Awareness of Information Security concepts and Best Practices

Good to have:

  • AWS/Azure cloud certification preferred
  • Certification in Kubernetes Administrator (CKA).
  • Certification in Kubernetes Application Developer (CKAD)
  • Experience with configuration management tools and strong code analysis skills in Python
  • Experience in working with APM-based tools like New Relic

We handle more than a billion messages every day. Rest assured, you will be surrounded by really smart and passionate people as we scale much more to build a world-class technology team.