Operations engineering

Image


My mission at Orange PFS began in June 2022 and lasted until November 2022. I went there right after my previous project, which was building a secure infrastructure.



The environment

We, with my team, were operating a big infrastructure composed of Redhat Linux servers. They were virtualized in an Orange datacenter on VMware. This infrastructure was already in a production environment, and we had to make sure everything worked as intented, as well as updating the applications that were running on it frequently.

This infrastructure was responsible of processing request from Orange clients when changing data plans or options on their phone contract. It was a very critical environment because if it went to fail, clients would be heavily impacted, because on the Oracle database, we could see each hour how many SQL queries there were. And there were a lot. This meant that a lot of customers were changing their options and plans during the day.



My tasks

The main task I had there was always making sure everything was working correctly. This included keeping an eye on the monitoring system, database information, and system logs. I also had to be careful of e-mail alerts, and we had an automated ticketing system where incidents were reported when something went wrong.

I also had to be proactive. If there was anything that I thought could lead to a failure, I had to try and fix it before it made an impact on the system. This was also our whole team job.

From time to time, we had to update the system and applications. As everything was in production, we had to plan everything accordingly. This included reading documentation carefully to make sure there is no error, make sure there is a rollback procedure, and installing during the night so there would be minimal user impact.

For a system as critical as this one, we also had on-call duty, where we could be called 24/7 if there was an incident. We then had limited time to resolve it before it escalated.



Teamwork

Teamwork here was very important because of the critical aspect of the infrastructure. Good communication was essential, because any action on the machines had consequences for us, and potentially for Orange customers. We had to make sure everyone was on the same page at each action and decision that was taken.

This project helped me a lot realising every aspect of running a critical production infrastructure. The amount of caution, precision, and anticipation it takes for it to run properly made me learn a lot.