Phoenix - Repair at Scale

2024-04-01Toni Cervero

Ever wonder how we keep Cloudflare’s global network running smoothly — even when hardware fails? My team built an autonomous diagnostics and recovery system that can detect, troubleshoot, and repair hardware issues at scale, without human intervention. From failed memory to power hiccups, our system turns reactive firefighting into proactive resilience. Read the full blog here: https://blog.cloudflare.com/autonomous-hardware-diagnostics-and-recovery-at-scale/ #Cloudflare #Infrastructure #Diagnostics #Automation #SRE

Read more →

A bit of internals with Uptime

2024-01-02Toni Cervero

# #

How uptime works

Read more →

Detecting New Devices on a Home Network Using eBPF and OpenWRT

2023-06-12Toni Cervero

# #

Catching new devices showing up on your Network

Read more →

A Simple Socket Tale

2023-01-27Toni Cervero

# #

A Simple Socket tale

Read more →

kind.engineering

2021-09-08Toni Cervero

# #

Automatically, when people ask me, “What is the first thing you consider when you need to scale?” my response is, “By investing in people.” However, this answer often fails to convince many fellow engineers in the room, who interpret it as merely advocating for “adding people to the problem” – more hands, less work – akin to a factory assembly line. Fortunately, the reality is far from this simplistic view, as the nature of software development requires a nuanced approach beyond just increasing manpower.

Read more →

The flight of the condor - Lessons learnt running a condor cluster

2021-02-04Toni Cervero

# #

In 2009, while working at DS2 as a Sys Admin in Valencia, I found myself in charge of building and running a Condor cluster (now known as HTCondor) to support large-scale simulations for Power Line Communication (PLC) systems. At the time, I didn’t think of it as “HPC” or “distributed computing.” It was just… a way to get simulations done before the deadline. Looking back, that experience taught me more about systems architecture, parallelism, and infrastructure-driven engineering than I realized at the time — and today, as I started learning AI/ML, I keep recognizing echoes of those lessons.

Read more →