A day in the life of a QPO Engineer
Edmund Heyes is a QPO Engineer here at G-Research, and he shared with us what a typical day is like in his role.
I’m usually in early and leave early on Thursdays, as I like to take advantage of G-Research’s flexible hours. Today it looks like I’ve beaten the cleaners in, which is a first!
The first thing I do is check the Adhoc SIM I kicked off last night. It’s used up all the space in the Adhoc share, which is obviously no longer fit for purpose as these SIMs are so big. But wait – the SIMs are big, but how come the share is so small?! I’m aware of a weekend issue after which our shares were recreated – this one with significantly less space! Following a quick chat with our InfraOps lead over Slack, they’ve confirmed they will re-size the share again if I send in a ticket. With that sorted, I’m able to give the developer an update, and luckily he’s not unhappy. It turns out, he has further fixes he wants to get into the Adhoc SIM.
I then check the status of our jobs in the job scheduler. Both sets of the more problematic jobs have completed, and it’s a nice feeling to be able to close those tickets. I decide to tell Sahil his recent patch seems to have done the job and we want it checked into the master branch.
I close any related outstanding alerts in the alert manager – not all alerts close by themselves. For the daily jobs that failed yesterday (due to general breakage after an unsuccessful network change), my python script seems to have simulated all the missing requests as the jobs relying on the resulting copied files have all completed. I load up my Jupyter notebook and rerun the same script to get files from the failover sites – I have a canned DB query that will show me when they appear and don’t anticipate this taking long.
I take a quick pause and a breath, before scanning the current set of tickets, closing those I finished yesterday. As for today’s jobs, I note down that there are several failures, though these were expected due to Andy’s explanation during yesterday’s standup meeting. Except one, which is particularly intriguing and requires further investigation. I grab some notes from the team playbook, and query the DB to see what’s going on under the hood.
Presumably this is partly a result of the unsuccessful network change – this job seems to have broken ranks and fired before its previous day’s counterpart had successfully run. David, one of the Devs in the team, is in similarly early, and spends a good 20 minutes taking me through the code and we work out what happened – I had ‘cancelled’ a SIM via the GUI after the network problem, but hadn’t ensured all associated jobs were stopped before clearing the results from the DB. One of these jobs subsequently re-updated the DB leaving it in an inconsistent state, fouling my rerun.
Having solved the issue, I tidy things up and rerun the job.
Status all good across the board – time for breakfast. It’s not long now before the standups – I’ll take the opportunity to check on the status of a few longer running ‘on hold’ tickets I have.
QPO standup meeting. This is a gathering of the larger Devops team I’m part of.
Attribution standup. This is a gathering of the Development/Reporting team I’m placed within. I let everyone know that operationally we’re in a good state.
I spend a while thinking about how to properly allocate memory for a new type of Adhoc job I’m running, such that I don’t also bump the memory for other similar jobs unnecessarily. The farm is constantly in use, and working out how best to balance the resource needs of individual jobs with the throughput requirement of the different workflows is a constantly evolving problem. Starve a job of memory and it can fail or take a long time to run. An overly generous allocation will prevent other jobs running in parallel on the same node.
I’ve written a tool to help manage resource allocation. It works well, but it’s a work in progress and setting initial requirements is still a bit of a dark art. I think the best bet is probably to investigate how requirements have scaled in another similar scenario and use some SQL-foo to manually create the initial setup.
Next up is a retro meeting with reps from the Quant side, Infra side and Attribution to go over what issues have cropped up over the last month and what’s coming down the pipeline. It’s an interesting discussion that covers a variety of topics, but no direct actions for me.
I notice that there are some failures in another batch of resource hungry jobs that I’d started yesterday. Surprise, surprise – they’ve run out of memory. Now that these jobs have run and failed, however, my tool ought to be able to pick up on that and bump the allocations where appropriate – why isn’t it picking up on these failures?
Ah – apps update a DB table with their resource usage on shutting down, and these stats feed my tool. It turns out that doesn’t work so well when the app crashes out due to a lack of memory. I can see how to get around this, though – essentially by rearranging a few SQL queries and using an extra source of data.
Having figured that out, I’m happy I now understand what’s going on – nice to realise that with some thought, this tool will become much more useful!
Happy days – it’s time to head off to lunch.
After lunch, I spend a bit of time trying to remember how to use pandas to update the tool.
The InfraOps team have now boosted the Adhoc share back to its original size. I kick off the SIM once again with the developer’s updated build.
Just after 16:00, I leave promptly to get my kids. I’ll get back to the pandas problem tomorrow.
If you like the sound of Edmund’s day-to-day, and you’re interested in joining G-Research, then check out our current job openings.