Site Reliability Engineer – Data Operations

1 June 2024

Full Time

Apply Now

Deadline date:

Job Description

JOB SUMMARY:

You will work closely with our software engineering and data teams to implement and maintain robust data pipelines and infrastructure. Your expertise in Google Cloud Platform (GCP) or Azure, container technologies like Kubernetes, or Docker, and Apache Airflow processes will be crucial in driving our success.

PRINCIPAL DUTIES & RESPONSIBILITIES:

Troubleshoot and resolve issues in live production environments and implement strategies to remediate them with minimal effort.
Manage applications through automation.
Support and monitor new and existing services, platforms, and application stacks.
Engage in improving the lifecycle of services deployment, operations, and refinement.
Provide technical expertise during service impacting events.
Collaborate with other engineers on code reviews, internal infrastructure improvements and process enhancements.
Use scalability testing to measure, tune and optimize system performance.
Participate in periodic 24×7 on-call duties.
Being accountable for resolving the outage via workaround or permanent fix
Ensuring all administration and reports are maintained and up to date including contacts information technical diagrams post major incident reviews.
Responsible for communicating with various stake holders.
Responsible for the effective implementation of the process Incident, Change and Problem Management and conducts the respective reporting procedure.
Monitor the incidents to ensure that the Service Level Agreement is respected.
Identify, initiate, and conduct incident triage.
Ensure the closure of all resolved and end-user confirmed Incident records.
Establish continuous process improvement cycles where the process performance activities roles and responsibilities policies procedures and supporting technology is reviewed and enhanced where applicable.
Knowledge on application and data monitoring fundamentals (Splunk, Open Telemetry, Dynatrace, Airflow DAG)
Knowledge of log parsing, complex Splunk searches, including external table lookups, Splunk data flow, components, features, and product capability.
Capability to setup alerts and from the machine generated data.

REQUIREMENTS:

Education: Bachelor’s Degree or Equivalent.
5+ years of experience in Software Engineering.
3+ years of experience in Site Reliability.
Experience with one or more Cloud Platforms (GCP, Azure, AWS).
Experience working with a data workflow management platform such as Apache Airflow.
Experience with Container technologies: Kubernetes, Docker, PKS.
Experience setting up monitoring applications and database.
Experience in third party services and third-party vendor management.
Excellent verbal, written, and interpersonal communication skills.
Experience in ServiceNow preferred.
Experience working with financial data is preferred (Metro2, 2052a, etc).

Date Posted

1 June 2024
Location

Charlotte, Aix-en-Provence, 28202, (US) Santa Clara CA
Category

AI Jobs
Expiration date

--

Site Reliability Engineer – Data Operations

Job Description

JOB SUMMARY:

PRINCIPAL DUTIES & RESPONSIBILITIES:

REQUIREMENTS:

Aftersales Business Intelligence Analyst

Data Scientist

Chef de projet Data Engineer

Follow Us:

+44 (0)20 3897 0728

Quick Links

For Candidates

For Employers

eWorker Group

Login to eWorker

Reset Password

Create a free eWorker account

Site Reliability Engineer – Data Operations

Job Description

JOB SUMMARY:

PRINCIPAL DUTIES & RESPONSIBILITIES:

REQUIREMENTS:

Share this post:

Related Jobs

Aftersales Business Intelligence Analyst

Data Scientist

Chef de projet Data Engineer

Follow Us:

+44 (0)20 3897 0728

Quick Links

For Candidates

For Employers

eWorker Group