DescriptionAs a Big Data Systems Engineer on the Big Data Operations (BDO) Team, you will be responsible for building, operating and supporting our diverse Data Pipeline Platform, which consists of large Hadoop, HBase and Kafka clusters in an all Linux environment. The platform currently ingests 250TB of new data and performs 20,000 ETL jobs every day across 8 Hadoop, 4 HBase and 6 Kafka Clusters.
About the Team:
This growing team consists of curious, passionate, talented technologists who enjoy working on complex, large scale distributed file and messaging systems.
Our motto is to move fast and sustain optimal uptime. Our team members thrive in a learn and teach environment. Each team member is encouraged to explore solutions and efficiency to support, optimize and maintain the systems. We are enthusiastic about troubleshooting, automation and optimization.
The team is managing over 2,000 Linux servers via extensive automation tools.
About the Job:
• Monitor, maintain, provision and upgrade Hadoop, Hbase and Kafka systems to support a complex Data Pipeline Platform.
• Participate in an on-call rotation responding to alerts and systems issues for Hadoop, Hbase, Kafka and more.
• Troubleshoot, repair and recover from hardware or software failures. Identify and resolve faults, inconsistencies and systemic issues. Coordinate and communicate with impacted constituencies.
• Manage user access and resource allocations to Data Pipeline Platform.
• Develop tools to automate routine day-to-day tasks such as security patching, software upgrades, hardware allocation. Utilize automated system monitoring tools to verify the integrity and availability of all hardware, server resources, and critical processes.
• Create new standard operating procedures for the team and focus on updating existing documentation for the team.
• Engage other teams during outages or planned maintenance.
• Administer development, test, QA and production servers.
• 4+ years of relevant experience in implementing, troubleshooting and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals
• 3+ years of relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell, Go, Perl, Java, C
• 2+ years of relevant experience with any of the following technologies: Hadoop-HDFS, Yarn-MapReduce, HBase, Kafka
• 2+ years of relevant experience with any of the following technologies: Puppet, Chef, Ansible or equivalent configuration management tool
• Familiar with TCP/IP networking DNS, DHCP, HTTP etc.
• Strong written and oral communication skills with the ability to interface with technical and non-technical stakeholders at various levels of the organization
• Beneficial skills and experience (if you don’t have all of them, you can learn them at Xandr):
• Experience with JVM and GC tuning is a plus
• Regular expression fluency
• Experience with Nagios or similar monitoring tools
• Experience with data collection/graphing tools like Cacti, Ganglia, Graphite and Grafana
• Experience with tcpdump, ethereal, tshark and other packet capture and analysis tools
More About You:
• You are passionate about a culture of learning and teaching. You love challenging yourself to constantly improve, and sharing your knowledge to empower others
• You like to take risks when looking for novel solutions to complex problems. If faced with roadblocks, you continue to reach higher to make greatness happen
• You care about solving big, systemic problems. You look beyond the surface to understand root causes so that you can build long-term solutions for the whole ecosystem
• You believe in not only serving customers, but also empowering them by providing knowledge and tools
#XandrLife means we’re creating an incredible experience for our people, too. Let our employees show you what it’s really like to work here.See what it's like here