Site Reliability Engineer – Data Operations

1 June 2024
Apply Now
Deadline date:

Job Description

JOB SUMMARY:

You will work closely with our software engineering and data teams to implement and maintain robust data pipelines and infrastructure. Your expertise in Google Cloud Platform (GCP) or Azure, container technologies like Kubernetes, or Docker, and Apache Airflow processes will be crucial in driving our success. 

PRINCIPAL DUTIES & RESPONSIBILITIES:

  •     Troubleshoot and resolve issues in live production environments and implement strategies to remediate them with minimal effort.
  •     Manage applications through automation.
  •     Support and monitor new and existing services, platforms, and application stacks.
  •     Engage in improving the lifecycle of services deployment, operations, and refinement.
  •     Provide technical expertise during service impacting events.
  •     Collaborate with other engineers on code reviews, internal infrastructure improvements and process enhancements.
  •     Use scalability testing to measure, tune and optimize system performance.
  •     Participate in periodic 24×7 on-call duties.
  •     Being accountable for resolving the outage via workaround or permanent fix
  •     Ensuring all administration and reports are maintained and up to date including contacts information technical diagrams post major incident reviews.
  •     Responsible for communicating with various stake holders.
  •     Responsible for the effective implementation of the process Incident, Change and Problem Management and conducts the respective reporting procedure.
  •     Monitor the incidents to ensure that the Service Level Agreement is respected.
  •     Identify, initiate, and conduct incident triage.
  •     Ensure the closure of all resolved and end-user confirmed Incident records.
  •     Establish continuous process improvement cycles where the process performance activities roles and responsibilities policies procedures and supporting technology is reviewed and enhanced where applicable.
  •     Knowledge on application and data monitoring fundamentals (Splunk, Open Telemetry, Dynatrace, Airflow DAG) 
  •     Knowledge of log parsing, complex Splunk searches, including external table lookups, Splunk data flow, components, features, and product capability. 
  •     Capability to setup alerts and from the machine generated data.

REQUIREMENTS:

  •     Education: Bachelor’s Degree or Equivalent.
  •     5+ years of experience in Software Engineering. 
  •     3+ years of experience in Site Reliability. 
  •     Experience with one or more Cloud Platforms (GCP, Azure, AWS).
  •     Experience working with a data workflow management platform such as Apache Airflow. 
  •     Experience with Container technologies: Kubernetes, Docker, PKS.
  •     Experience setting up monitoring applications and database. 
  •     Experience in third party services and third-party vendor management. 
  •     Excellent verbal, written, and interpersonal communication skills. 
  •     Experience in ServiceNow preferred. 
  •     Experience working with financial data is preferred (Metro2, 2052a, etc).