Do you want to tackle the biggest questions in finance with near infinite compute power at your fingertips?
G-Research is a leading quantitative research and technology firm, with offices in London and Dallas.
We are proud to employ some of the best people in their field and to nurture their talent in a dynamic, flexible and highly stimulating culture where world-beating ideas are cultivated and rewarded.
This is a hybrid role based in our new Dallas infrastructure hub where we work on the latest technologies in a cutting-edge environment.
The role
The Observability Platform team manages the doors – both entry and exit – to the telemetry services that are managed by the Platform Reliability and Observability Team. We ensure that engineers can effectively produce and consume telemetry for their services. This involves working with the Observability Engineering team to build robust pipelines to ingest and route data in predictable, composable ways as well as visualising that data after the fact to drive insight and action.
Under the umbrella of the Platform Engineering department, our group also has responsibility to mature the reliability of our full HPC stack, from networks and storage up to the compute and application platform layers.
We are seeking a Manager with deep expertise in observability stacks. You will understand the unique problems that come with moving cloud-level volumes of telemetry data at scale, and be excited at the prospect of ensuring our customers have eyes into the same underlying telemetry data to run their services as efficiently as possible.
Knowledge of and experience running observability platforms at scale, serving a wide variety of customers with varying degrees of access, is a strong requirement. Knowledge of core SRE principals is highly beneficial.
Key responsibilities of the role include:
Helping to lead the development of our observability and reliability engineering strategy
Defining and driving the roadmap for observability tooling, ensuring alignment with business goals and scalability requirements
Working with telemetry data at enormous scale, ingesting data from industry-leading GPU clusters
Acting as the lead for all observability efforts related to AWS services, ensuring seamless integration with the observability platform
Collaborating with engineering leadership to establish observability as a core function of the development lifecycle
Working closely with application teams to ensure observability systems are fully integrated and providing the necessary insights
Enabling SRE frameworks, promoting SLAs, SLOs and SLIs, and working closely with platform teams to ensure reliability is constantly improving
Growing, adapting and investing in your team, fostering a culture of continuous learning and improvement, encouraging adoption of new observability tools and techniques
Who are we looking for?
The ideal candidate will have the following skills and experience:
Proven experience leading observability or SRE teams in a cloud-native or hybrid-cloud environment, running platforms in production and at scale
Well versed in reliability engineering concepts, including different types of testing, progressive deployments, error budgets, the role observability plays and fault-tolerant design
Hands-on experience with modern observability tools and frameworks such as Prometheus, OTEL (OpenTelemetry), Grafana and enterprise SaaS Observability platforms, such as Datadog and Dynatrace
Expertise in designing, building and scaling observability solutions for distributed systems
Customer focused, with an enthusiasm for providing infrastructure as a service and defaulting to a product lens when evaluating platform scale problems
Excellent communication skills and the ability to collaborate with cross-functional teams
Leadership experience with demonstrated success in mentoring and developing technical talent
Experience with cloud platforms, such as AWS, Azure or Google Cloud
Familiarity with microservices architecture and containerized environments, such as Kubernetes and Docker
Knowledge of infrastructure as code (IaC) and automation tools, such as Terraform and Ansible
Why should you apply
Market-leading compensation plus annual discretionary bonus
Lunch provided in the office (via GrubHub)
Informal dress code and excellent work/life balance
Excellent paid time off allowance of 25 days
Sick days, military leave, and family and medical leave
Generous 401(k) plan
16-weeks’ fully paid parental leave
Medical and Prescription, Dental, and Vision insurance
Life and Accidental Death & Dismemberment (AD&D) insurance
Employee Assistance and Wellness programs
Generous relocation allowance and support
Great selection of office snacks, and hot and cold drinks
Free on-site gym and car parking
This role is employed through our US affiliate.