The new AT&T advertising technology business is on a unique mission through the launch of its newest business Xandr. This organization born as recently as August 2018 & built on a foundation of 140 years of AT&T excellence and multiple acquisitions of top assets and people, is driving the advertising industry forward by creating new options for advertisers and publishers to find and reach specific audiences at scale in trusted, premium content environment. A fundamental enabler for our strategy is technology, and we have launched a new technology center of excellence in Bangalore which is searching for candidates who embody the Xandr principles - create with curiosity and courage, believe in better, use data responsibly, pursue differences, reflect and imagine, and teach and learn.
Big Data Systems Operations Engineer I (Hadoop, HBase, Kafka, Big Data Operations & Support)
As a Big Data Systems Operations Engineer, you will operate and support our diverse Data Pipeline Platform, which consists of large Hadoop, HBase and Kafka clusters in an all Linux environment. The platform currently ingests 300TB of new data and runs 20,000 ETL jobs every day across 8 Hadoop, 5 HBase and 6 Kafka Clusters.
About the Team:
This growing team consists of curious, passionate, talented technologists who enjoy working on complex, large scale distributed file and messaging systems.
Our motto is to move fast and sustain optimal uptime. Our team members thrive in a learn and teach environment. Each team member is encouraged to explore solutions and efficiencies to support, optimize and maintain our systems. We are enthusiastic about automation and optimization.
The team is managing over 2,000 Linux servers via extensive automation tools.
About the Job:
• Support a complex Data Pipeline Platform by monitoring, maintaining, provisioning and upgrading Hadoop, HBase, Kafka, Graph and ETL systems using proprietary automation tools.
• Develop new tools to automate routine day-to-day tasks, such as security patching, software
upgrades and hardware allocation. Utilize automated system monitoring tools to verify the
integrity and availability of all hardware, server resources, and critical processes.
• Troubleshoot and analyze hardware or software failures and provide solutions to
recovery. Identify and resolve faults, inconsistencies and systemic issues.
• Collaborate with engineering team partners to resolve complex system performance issues.
• Participate in on-call rotation, responding to alerts and system issues
• 3+ years of relevant experience in implementing, troubleshooting and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals
• 1+ years of relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell
• 1+ years of relevant experience with any of the following technologies: Hadoop-HDFS, Yarn- MapReduce, HBase, Kafka
• 1+ years of relevant experience with any of the following technologies: Puppet or equivalent configuration management tool
• Familiar with TCP/IP networking DNS, DHCP, HTTP etc.
• Good written and oral communication skills
• Some experience with Nagios or similar monitoring tools
• Some experience with data collection/graphing tools like Graphite and Grafana
Flexible to work in 24*7 shifts
More About You:
• You are passionate about a culture of learning and teaching. You love challenging yourself to constantly improve, and sharing your knowledge to empower others
• You like to take risks when looking for novel solutions to complex problems. If faced with roadblocks, you continue to reach higher to make greatness happen
• You care about solving big, systemic problems. You look beyond the surface to understand root causes so that you can build long-term solutions for the whole ecosystem
• You believe in not only serving customers, but also empowering them by providing knowledge and tools