As a Big Data Systems Operations Engineer, you will build, operate and support our diverse Data Pipeline Platform, which consists of large Hadoop, HBase and Kafka clusters in an all Linux environment. The platform currently ingests 300TB of new data and runs 20,000 ETL jobs every day across 8 Hadoop, 5 HBase and 6 Kafka Clusters.
About the Team:
This growing team consists of curious, passionate, talented technologists who enjoy working on complex, large scale distributed file and messaging systems.
Our motto is to move fast and sustain optimal uptime. Our team members thrive in a learn and teach environment. Each team member is encouraged to explore solutions and efficienciesto support, optimize and maintain our systems. We are enthusiastic about automation and optimization.
The team is managing over 2,000 Linux servers via extensive automation tools.
About the Job:
- Support a complex Data Pipeline Platform by monitoring, maintaining, provisioning and upgrading Hadoop, HBase, Kafka, Graph and ETL systems using proprietary automation tools.
- Develop new tools to automate routine day-to-day tasks, such as security patching, software upgrades and hardware allocation. Utilize automated system monitoring tools to verify the integrity and availability of all hardware, server resources, and critical processes.
- Troubleshoot and analyze hardware or software failures and provide solutions to recovery. Identify and resolve faults, inconsistencies and systemic issues.
- Collaborate with engineering team partners to resolve complex system performance issues.
- Support other teams during incidents or planned maintenance. Coordinate and communicate with impacted constituencies. Drive and participate in postmortems to avoid repeated incidents.
- Participate in on-call rotation, responding to alerts and system issues.
- Modify, enhance and create new standard operating procedures for the team.
- Identify and implement operational best practices and process improvements.
- Develop and foster a positive relationship with team members and engineering partners, encouraging and sharing knowledge.
- 4+ years of relevant experience in implementing, troubleshooting and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals
- 3+ years of relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell
- 2+ years of relevant experience with any of the following technologies: Hadoop-HDFS, Yarn-MapReduce, HBase, Kafka
- 2+ years of relevant experience with any of the following technologies: Puppet or equivalent configuration management tool
- Familiar with TCP/IP networking DNS, DHCP, HTTP etc.
- Strong written and oral communication skills with the ability to interface with technical and non-technical stakeholders at various levels of the organization
- Beneficial skills and experience (if you don’t have all of them, you can learn them at Xandr):
- Experience with JVM and GC tuning is a plus
- Regular expression fluency
- Experience with Nagios or similar monitoring tools
- Experience with data collection/graphing tools like Graphite and Grafana
- Experience with tcpdump, ethereal, tshark and other packet capture and analysis tools
More About You:
- You are passionate about a culture of learning and teaching. You love challenging yourself to constantly improve, and sharing your knowledge to empower others
- You like to take risks when looking for novel solutions to complex problems. If faced with roadblocks, you continue to reach higher to make greatness happen
- You care about solving big, systemic problems. You look beyond the surface to understand root causes so that you can build long-term solutions for the whole ecosystem
- You believe in not only serving customers, but also empowering them by providing knowledge and tools