Senior Site Reliability Engineer – Storage
Location : Dallas, TX
Do you want to tackle the biggest questions in finance with near infinite compute power at your fingertips?
G-Research is a leading quantitative research and technology firm, with offices in London and Dallas. We are proud to employ some of the best people in their field and to nurture their talent in a dynamic, flexible and highly stimulating culture where world-beating ideas are cultivated and rewarded.
This is a hybrid role based in our new Dallas infrastructure hub where we work on the latest technologies in a cutting-edge environment.
The Big Data and Storage Engineering teams currently sit within the Platform as a Service (PaaS) Function at G-Research.
Both teams manage a variety of technologies within our ecosystem, including VAST, Dell Isilon and ECS storage appliances, through to BigData platforms such as Hadoop HDFS, Airflow, and distributed computing frameworks such as Spark, YARN and Trino.
We are seeking an experienced Senior Site Reliability Engineer (SRE) to join our PaaS Function. You will has a proven track record in managing and optimizing complex Big Data platforms and/or cutting-edge storage technologies.
We want someone who excels in ensuring the robustness, scalability, and fault tolerance of large-scale data infrastructure. You will have a comprehensive understanding of the intricacies involved in architecting, deploying, and maintaining high-performance storage solutions, coupled with a track record of implementing and enhancing reliability measures within Big Data ecosystems.
This role demands hands-on experience in orchestrating resilient systems, fine-tuning storage performance, and implementing proactive strategies to mitigate potential downtime and disruption. The successful candidate will play a pivotal role in driving the reliability, efficiency, and scalability of our Big Data and Storage systems through innovative solutions and best-in-class practices.
In return, you will gain exposure to the latest hardware and software technologies in a forward-thinking company, which values innovation, personal development and training.
Key responsibilities of the role include:
- Leading efforts to enhance existing practices across both teams, fostering collaboration and synchronization to optimize system reliability and scalability
- Driving strategies for enhancing systems performance, leveraging innovative approaches to improve efficiency and streamline processes
- Implementing best practices for system reliability, fault tolerance, and scalability, ensuring alignment with evolving industry standards
- Cultivating a culture of continuous improvement, encouraging regular reviews and iterative enhancements to tools, methodologies, and processes
- Enhancing incident response processes by conducting comprehensive reviews, implementing improvements, and integrating learned lessons into future strategies
- Leading efforts to optimize capacity planning strategies, ensuring systems are prepared for future scaling while maximizing resource utilization
- Collaborating with security teams to fortify and enhance security measures within systems, ensuring compliance with evolving policies and standards
- Collaborating effectively with other SREs within PaaS and colleagues in different time zones (Dallas and London)
Who are we looking for?
You will be an experienced Platforms Reliability Engineer who is enthusiastic about contributing to an automated, scalable, reliable and high performing Big Data and Storage Platform.
The ideal candidate will have the following skills and experience:
- A strong desire to continually learn about new technologies, approaches, and systems, along with the agility to work across multiple teams
- Familiarity with large-scale storage systems, including distributed systems, like HDFS, object storage, like Amazon S3, and file storage systems
- A self-starter with excellent problem-solving skills
- Proficient in Python and other programming languages, such as Java, Scala, or Go for automation and development tasks
- Extensive Linux, Networking and Infrastructure knowledge
- Experience with CI/CD (preferably Jenkins and ArgoCD) and Configuration Management tools, such as Ansible and Terraform
- Experience deploying and running applications on Docker and Kubernetes, including the creation of Helm charts
- Familiarity with monitoring tools like Prometheus, Grafana and the ELK stack (Elasticsearch, Logstash, Kibana) or similar
- Understanding of core SRE concepts and their implementation in platform engineering
Beneficial experience would include:
- Proficiency in working with various BigData and Storage technologies such as Hadoop, Spark, or similar distributed computing frameworks, and VAST, Isilon or similar Storage appliances
- Working with BigData and Storage solutions on cloud platforms such as AWS, Azure, or GCP
Why should you apply?
Market-leading compensation plus annual discretionary bonus
Informal dress code and excellent work/life balance
Excellent paid time off allowance of 25 days
Sick days, military leave, and family and medical leave
Generous 401(k) plan
16-weeks' fully paid parental leave
Medical and Prescription, Dental, and Vision insurance
Life and Accidental Death & Dismemberment (AD&D) insurance
Employee Assistance and Wellness programs
Generous relocation allowance and support
Great selection of office snacks, and hot and cold drinks
On-site gym and car parking
G-Research is committed to cultivating and preserving an inclusive work environment. We are an ideas-driven business and we place great value on diversity of experience and opinions.
We want to ensure that applicants receive a recruitment experience that enables them to perform at their best. If you have a disability or special need that requires accommodation please let us know in the relevant section.Apply