Senior Site Reliability Engineer - FedRAMP
At Talkdesk, we are courageous innovators focused on redefining the customer experience, making the impossible possible for companies globally. We champion an inclusive and diverse culture representative of the communities in which we live and serve. And, we give back to our community by volunteering our time, supporting non-profits, and minimizing our global footprint. Each day, thousands of employees, customers, and partners all over the world trust Talkdesk to deliver a better way to great experiences.
We are recognized as a cloud contact center leader by many of the most influential research organizations, including Gartner and Forrester. With $498 million in total funding, a valuation of more than $10 Billion, and a ranking of #8 on the Forbes Cloud 100 list, now is the time to be part of the Talkdesk legacy to help accelerate our success in a new decade of transformational growth.
At Talkdesk, we embrace FAST, our fundamental operating principles that define who we are as an organization. These principles drive us to make the impossible possible. FAST: Focus + Accountability + Speed = Talkdesker.
- Focus: Focus time, energy and attention on what is most impactful for the business and thoughtful about how and when to partner with others.
- Accountability: Hold self and others accountable to meet commitments and drive results. Accept responsibility for successes and failures.
- Speed: Execute with agility and urgency. Act promptly, decisively, and without delay. Make good and timely decisions that keep the organization moving forward.
- Talkdesker: YOU!
Our Engineering team follows a micro-service architecture approach to build the next generation of Talkdesk, with vertical teams responsible for all the decisions under their services.
We are looking for Senior Site Reliability Engineers (SREs) who can lead in helping us design, build, and maintain high-performance, scalable, and reliable services, that serve as the infrastructure foundation for the rest of Talkdesk, with the objective of having the least manual intervention possible, while also ensuring high availability and reliability of those components.
We believe in a DevOps philosophy where every engineering team at Talkdesk should be responsible for the software they build and deploy and SREs play a critical role in ensuring that the teams have the tools, practices, and expertise to make that happen in a blame-free culture.
- Design, build, harden, and maintain the core infrastructure used by all of Talkdesk’s engineering teams
- Automate every aspect of our infrastructure to remove as much as possible any human intervention
- Participate in design reviews and production reviews for new features, products, or pieces of infrastructure
- Help keep existing base infrastructure running smoothly
- Develop effective tooling, alerts, and response to both identify and address reliability risks
- Drive and promote protocols on production readiness and operational excellence
- Participate in on-call rotation alongside other engineering teams (opt-in)
- Partner with product engineering teams to debug production outages and carry out action items to improve reliability of those systems
- Plan for evolution and growth of Talkdesk’s infrastructure
- 3+ years of experience working with AWS
- 3+ years of experience working with Linux/Unix systems
- 3+ years experience with at least one of the following: bash, python, Java , or any JVM-based language (i.e. Kotlin)
- 2+ years experience with Terraform
- 2+ years experience with Ansible
- 2+ years experience with kubernetes
- 3+ years experience with at least 3 of the following: messaging systems such as RabbitMQ or Kafka, data stores such as MongoDB, Postgres, MySQL, MariaDB, Redis, Cassandra, or Elasticsearch
- Experience with Monitoring Tools like Datadog, New Relic, Grafana or similar
- Understanding of the importance of observability, and good intuitions about what to measure and how
- Ability to identify time consuming and error prone manual tasks and then build tooling to automate them
- Ability to identify and understand large-scale complex systems from a reliability & availability perspective
- Ability to identify root causes of instability in a large-scale distributed system, across stacks
- Hold yourself and others around you to higher stands when working with production
- Bringing a developer mindset and applying it to infrastructure
- Solution Focused
Nice to haves / Pluses:
- Experience with technologies such as Docker, Consul, Vault, Jenkins, Concourse, Prometheus, Nexus
- Experience with encryption technologies such as GoPass, ACM, KMS, Hashing
- Experience with PaaS-like solutions such as Heroku, Kubernetes, Docker Swarm, Mesos, or OpenStack
- Experience with designing and operating IP networks
Work Environment and Physical Requirements:
Primarily office-environment work, extended periods of sitting or standing, computer-based work. Limited lifting, and equipment usage limited to computer-related equipment (keyboards, mouse, etc.)