Measuring Performance the Right Way

A Holistic Approach to Evaluating Engineers Beyond Short-Term, Visible Impact

Rafa Páez

and

Jose Parreño Garcia

Mar 23, 2025

Today’s article is a guest post by

Jose Parreño Garcia

, Senior Data Science Lead.

Jose is the author of

Senior Data Science Lead

, one of my favorite newsletters about Data Engineering, Data Science and Leadership!

Thank you, Jose, for sharing your experiences, lessons, and insights with all of us.

"What gets measured gets managed." But, what if we are measuring the wrong things?

In most organisations, revenue, cost and user driven metrics (acquisitions or retention) are the North Star metrics. And, if you build solutions that significantly move these, then, of course you should be recognised.

However, not all impact is equal. Today, I want to share with you 5 real stories where traditional impact measurement falls short. Each story will challenge a common assumption:

The £1M that wasn’t fully recognized. Why a machine learning improvement that generated $1M in revenue wasn’t considered strategic.
The £1M that was recognized (and why context matters). A parallel example where a similar financial outcome led to much greater recognition.
The problem with recognition-based impact. How a team that provided company-wide enablement received 10x the visibility but wasn’t necessarily creating 10x the impact.
The challenge of measuring slow-moving metrics. Why long-term improvements often go unrewarded, even when they fundamentally change an organisation’s capabilities.
The amplifier effect: impact that multiplies over time. How a single contributor reshaped engineering standards and de-risked projects without producing a measurable metric.

Story #1: Flights ranking: the £1M that wasn’t fully recognized

"A £1M revenue boost sounds like a success story. So why wasn’t it enough?"

Let’s start with a case that should, on paper, be an easy win.

One of our senior data scientists was working on flights ranking, a core component of how we optimize search results for travelers. The goal was to improve how we rank flight options, leading to better user engagement and, ultimately, higher conversion rates.

They successfully deployed an experiment that increased redirect rates by small percentage, but leading to an additional £1M in annualized revenue.

A million dollars. That is real, measurable impact.

So, why wasn’t this a clear-cut recognition case?

The issue wasn’t the outcome. It was how the result was achieved.

Instead of developing a novel ranking approach or strategically enhancing our model, the senior data scientist relied on brute-force optimization. The approach itself required persistence, patience, and solid execution, but not deep innovation.

And to be clear: this is not a bad approach. It’s a completely valid way to optimize machine learning models. In fact, for a junior or mid-level data scientist, this would be an outstanding contribution. But for a senior? The expectation is not just to find a better model, it’s to think more strategically about how we get there.

(PS: If you are interested in knowing more about our competency framework, check this article where I go into it in detail)

Recognition vs. long-term strategic growth

This is where the distinction between recognition and career progression becomes important.

Recognition is deserved. This person delivered real business value. It’s entirely reasonable to acknowledge their effort. A stronger bonus or public appreciation are great ways of doing so.
But was it a promotion-worthy achievement? Not necessarily. Strategic, senior-level impact goes beyond just testing variations. It involves pushing boundaries, introducing new frameworks, or fundamentally improving our approach.

If we start promoting based solely on absolute impact, without considering how the impact was achieved, we set the wrong precedent:

We risk incentivizing brute-force, low-leverage solutions instead of encouraging deep problem-solving.
We signal that optimization work alone is enough to climb to the next level.
We fail to differentiate between impact that any capable data scientist could achieve and impact that truly redefines our competitive advantage.

So, the real challenge is to figure out if we are rewarding results, or are we building a culture that encourages the right kind of impact?

Next, let’s look at another £1M success story—but this time, one that was fully recognized.

Story #2: Hotel ranking: the other £1M, but this time, it was recognised

"Two teams. Two ranking models. Two £1M revenue boosts. So why was this one seen differently?"

On the surface, the outcomes look identical. The flights ranking experiment and the hotel ranking project both increased revenue by $1M per year. But the effort, complexity, and long-term impact of the two could not have been more different.

They had to design and implement a brand-new ranking system from scratch.
They worked cross-functionally with engineers, data scientists, and product teams to get it production-ready.
They integrated machine learning models, heuristics, and operational constraints to ensure accuracy.

The heuristics they initially used may not have been more complex than the brute-force approach used in flights ranking. But the difference is they weren’t just iterating on a model; they were creating the entire infrastructure that powered ranking.

Same financial impact, but a different kind of work

So why did this team receive more recognition than the flights ranking team? Because absolute impact isn’t the only factor. Context matters.

Flights Ranking was an optimisation. They tuned parameters and ran brute-force experiments to squeeze out incremental gains.
Hotel Ranking was an architectural rebuild. They created an entirely new framework that could support further innovation.

When evaluating impact, we need to ask:

Is this solving a high-leverage problem? The hotel ranking system improved not just one model but the entire infrastructure, unlocking future enhancements beyond this single £1M win.
Does this scale beyond the immediate project? The new system can support future features and ML improvements, creating a compounding effect.
Is this work foundational? Unlike a parameter tweak, this was a major system investment that elevated our ranking capabilities across the company.

In this story, the key element is that we ensured that we recognised and rewarded work that raised the bar.

Next, let’s shift the conversation from financial impact to another flawed metric: recognition-driven measurement.

Story #3: Enablement teams: The problem with visibility-based recognition

"If a team gets 10x more public praise, does that mean they are 10x more impactful?"

So far, we have examined impact through the lens of measurable business value. But what happens when impact is measured by recognition rather than actual contribution?

This is where visibility bias comes into play.

In our data science discipline, we have a centralised enablement team responsible for maintaining and improving our A/B testing statistical engine.

As part of their role, they:

Provide support for teams across the company running experiments.
Host “green flag” Q&A sessions to help teams interpret results.
Ensure that our internal experimentation framework is used correctly.

Because of their cross-functional nature, they interact with far more teams than a typical data science squad. And since they are often solving urgent, visible problems, they receive a great amount of recognition.

We have an internal “high-5” system where employees can send thank-you notes and public appreciation to colleagues. It’s a great tool for celebrating contributions, but if we were to measure impact by sheer number of high-5s, the enablement team would outperform every other team by an order of magnitude.

Does that mean they are 10x more impactful than a team working on core ranking models?
Does that mean they should be prioritized for promotions over others who work on complex, but less visible, problems?

Of course not. But this is exactly what happens when we conflate recognition with impact.

So, how do we know if enablement teams are overperforming?

This doesn’t mean the enablement team isn’t valuable. Let me be clear: they are. Their role is to amplify the effectiveness of others. But measuring their impact solely by recognition would be like judging an engineer’s effectiveness by the number of Slack messages they send.

Instead, a good way of looking at contributions from enablement teams would be:

Look at the multiplier effect: how much do they improve the quality of experimentation across the company?
What tooling did they ship that helped teams move forward in new ways?
How many green flag duties do they have to cover? Paradoxically, if the platform was perfect, the team wouldn’t get too many urgent requests.

Next, let’s look at another case where traditional impact measurement falls short: slow-moving metrics that take years to show their full value.

Story #4: Slow metrics, where long-term impact is hard to measure

"If a team spends three years improving a system, but the results come in tiny increments, does that make their work any less valuable?"

Some of the most important work in a company happens slowly, bit by bit, over months or even years. If we are not careful, we risk undervaluing it simply because it doesn’t fit neatly into quarterly or annual reporting cycles.

The flight price calendar: A 3-year effort with an astonishing result

One of the most complex challenges in flight meta-search is accurately estimating flight prices over time. Imagine a traveler looking at our calendar view, where they can see estimated prices for flights across an entire month or even a full year.

Seems straightforward, right? I can tell you, it is not.

Flight prices change constantly, sometimes multiple times per day.
We don’t always have real-time data for every date, meaning we need to estimate missing prices.
Users expect accuracy. If they see a flight for £60 on the calendar but land on a page where it’s suddenly £80, they lose trust in our platform.

For years, our coverage accuracy metric hovered at 40%. Because of that, our calendar view was super sparse, as we didn’t want to fill it with prices where we knew that 60% of the times they were wrong.

A dedicated team spent 3 years improving this system.

They developed new estimation techniques to increase accuracy.
They improved how we pull and refresh pricing data at scale.
They fine-tuned coverage algorithms to ensure fewer gaps in the calendar.

And after 3 years of continuous work, they raised the accuracy metric from 40% to 80%.

Why this work was at risk of being overlooked

This improvement happened bit by bit, quarter by quarter. No single release made a giant, eye-catching difference overnight. But when you zoom out, the impact is undeniable:

80% accuracy means a dramatically better user experience.
Users trust the calendar view more, increasing engagement and bookings.
The foundation is now in place for future improvements, making our pricing system far more scalable.

If we are not intentional, slow-moving improvements like this get ignored because:

We expect impact to be immediate. Quarterly results dominate decision-making, making multi-year investments seem “less urgent.”
It’s hard to tie improvements directly to revenue. Unlike an A/B test that increases conversions by 1%, this work supports many different downstream impacts.
Incremental gains are harder to “sell” internally. A team that says, “We improved accuracy from 40% to 43% this quarter” sounds less impressive than one that says, “We ran an experiment that made $500K.”

So, next time, ask yourself if you are looking at the full picture, not just quarterly snapshots? If you are recognising compounding improvements, even if they take years? Or, if you are valuing long-term thinking as much as short-term wins?

Finally, let’s look at an even harder-to-measure form of impact: The Amplifier Effect.

Story #5: The amplifier effect

"What if the most valuable work doesn’t move a single metric—but makes everything better?"

We have explored cases where impact was measured incorrectly, undervalued, or too slow to be appreciated. But what about work that doesn’t have a direct metric at all? Work that multiplies the effectiveness of others, but is never attached to a single person’s performance review?

A Data Scientist who transformed engineering standards

One of the most impactful people on my team wasn’t a senior engineer or a principal scientist. She was a mid-level data scientist who saw a problem, took ownership, and quietly reshaped how we worked.

She noticed that our engineering standards for Git repos, data storage, and other core infrastructure were inconsistent.
Instead of waiting for someone else to fix it, she took the initiative to improve them.
She reached out to senior engineers for guidance, learning how to implement best practices properly.
She built a new, standardized approach that the team agreed was the right way forward.

The improvements weren’t tied to a specific revenue goal. They weren’t even officially on the roadmap. But, they made everything smoother, faster, and more maintainable.

Then, came the most important part. Instead of moving on to another project, she focused on spreading what she learned.

She trained 3 other teammates on the new framework.
She ensured no single person was a bottleneck for this knowledge.
She created a standard approach that future projects could build on.

The problem with traditional impact measurement

If you judged her performance by standard metrics, her work didn’t show up anywhere.

She didn’t build a new model.
She didn’t deliver a direct revenue increase.
She didn’t improve an existing KPI.

But what she did do was de-risk future projects, improve team efficiency, and level up the entire organization. And as part of our competency framework, she nailed the “expertise” and “build it right” sections for her level. Thus, was accordingly rewarded with a great bonus and more challenges that would set her up to a promotion path.

Rethinking how we measure impact

"Not all impact is measurable. And not all measurable impact is equal."

I hope you have enjoyed my 5 stories that challenged me personally to change the way I was defining impact. To summarise them:

Impact isn’t just about revenue. Two $1M projects can have vastly different strategic value.
Visibility doesn’t equal value. The loudest contributions aren’t always the most important.
Slow-moving progress matters. Some of the most critical work takes years to show results.
Amplifiers shape the future. The best contributors don’t just create impact—they enable it at scale.

If we only reward what is easiest to measure, we risk deprioritizing foundational work, discouraging long-term thinking, and failing to recognize those who make the entire system better.

Enjoyed the article? Hit the ❤️ button and share it with others who might find it helpful. Subscribe to support my work and stay updated on future issues!

A guest post by

Jose Parreño Garcia

Building and leading Data Science teams @Skyscanner. Now, I help tech leaders scale teams, guide aspiring data scientists, and level you up data in visualisation. No fluff, just real-world lessons from shipping ML to production and leading at scale.

The Engineering Leader

Discussion about this post