John Keells Holdings PLC
Senior Platform Engineer – Pricing Platform
Job Description
Your day to day
– Derive incident management and root cause analysis recommendations into improvement points for the platform.
– Work with the latest (automation) tooling with a strong focus on performance,reliability,observability and security.
– Define platform lifecycle management, resilience patterns, architecture and roadmaps together with solution, domain and enterprise architects.
– Align platform expected changes with stakeholders, financial controllers, and report on platform volumes to area lead.
– Present platform and automation best-practices to team and at in-/external engineering events.
– Report on state of IT Risk & security controls on the platform as per ING Information Security Management Policy.
– Apply CI/CD using Azure DevOps as well as remote operations on the platform.
– Through Agile/Scrum, collaborate with the other engineers to bring live new sprint releases every 2-4 weeks to Acceptance and Production.
– You are committed to staying updated with the latest developments in HPC and Cloud tech and participating in relevant workshops, conferences, and training programs is part of your nature.
– You meet frequently with product managers, analysts and researchers to gather and incorporate stakeholder feedback to improve HPC services.
What you’ll bring to the team
Experience: 5+ years of software engineering / operations experience
Tech stack/ knowledge:
Mandatory:
– IT Operations/Support experience combined with analytical skills to identify root causes in incidents (data, technical, functional).
– Strong understanding of high performance computing environments, including HPC (GPU) clusters, parallel computing principles, distributed computing principles and techniques
– Strong understanding of using GPU technology as computational accelerator and proficiency in cluster management and job scheduling systems (e.g., DataSynapse, Slurm, PBS, LSF).
– Knowledge of GPU architectures and technologies (e.g., NVIDIA CUDA, AMD ROCm).
– Experience with deploying and maintaining parallel programming models and libraries (e.g., MPI, OpenMP, CUDA), middleware and supporting software.
– Ability to identify and contribute to resolving performance bottlenecks in HPC applications via monitoring / observability practices.
– Knowledge of CI/CD, experience with Git, Python, Ansible, Shell scripting and working experience with monitoring practices and alerting tools.
– Strong Linux (RHEL 8 or 9), Azure DevOps experience, pipeline and Ansible skills and experience working with certificates / encryption technology.
– Strong experience in translating computational requirements to IT concepts like system sizing.
– Experience with Grafana and tools for alerting like Prometheus, as well as a strong understanding of complex subsystem monitoring and alerting
– Good understanding of the ELK Stack and how to interact with it
– Experience in mentoring junior engineers and providing technical guidance.
– Thoroughness in testing validating the configurations ,optimizations and system reliability, performance.
– Education at Master level with a strong analytical background in Computer/Data Science, Cybernetics, Software Engineering, Financial Engineering or equivalent.
– Due to the cross-border nature of IT teams at ING, we ask that English (advanced) is part of your skillset.
Nice to have:
– Familiarity with Oracle 12c/19c with PL/SQL
– Experience with cloud-based HPC solutions (e.g., AWS, Azure) and understanding of hybrid HPC environments and cloud integration is a plus.
– Affinity with (GPU) programming languages and frameworks (e.g., CUDA, OpenCL, Pytorch) is a plus.
– Familiarity with in-memory caching tools (Apache Ignite (GridGain), Redis et al).
– Familiarity with shared storage configuration and design.
– Programming skills in languages such as C, C++, and Java is a plus.
– Familiarity with Docker and orchestration for it (Kubernetes, Openshift et al.)
– Good Linux networking skills.