Your ML Pipeline Meets the EU AI Act

On 7 June 2026, I gave my talk Your ML Pipeline Meets the EU AI Act in Grand Hall 1 at PyData London 2026. The talk resources are online: interactive slides, the companion notebook, an Evidently drift report, and the evidence-readiness checklist as PDF and Markdown.

I did not want to turn this into a standard conference recap. The more useful question is practical: What does the EU AI Act mean for an ordinary ML pipeline that still has to be trained, deployed, and monitored next week?

Speaker card: Gabriel Lipnik – Your ML Pipeline Meets the EU AI Act, PyData London 2026, 7 June 10:15 BST — The speaker card from the programme. From here on, the post stays with the pipeline.

The Question That Sticks

I opened the talk with a small show-of-hands exercise: Who has a model in production that makes a decision about a person? A loan, a job application, an insurance quote, a fraud flag, something in that family. Then the follow-up: Who could still explain today why one specific decision six months ago came out the way it did?

This is where the room usually gets quieter. Not because the teams are doing bad work. More often, because our systems are excellent at serving requests and much worse at reconstructing decisions later.

The old operational question was: Do we have logs? Did the service respond, did it error, what was the latency? For high-risk systems, that is not enough. The harder question is: Can we reconstruct why? Article 12 of the EU AI Act is easy to read as a logging topic. In practice, it feels more like evidence architecture: not just recording what happened, but preserving enough context for someone else to understand it later.

Why Credit Scoring?

The talk follows a synthetic credit-risk-scoring example. It is not an exotic setup: collect data, build features, train a model, evaluate it, deploy it, monitor it. That is why I like it for this walkthrough. Many ML pipelines have the same shape, even when they are not deciding loans.

Credit scoring is also a place where the AI Act quickly becomes concrete. As of June 2026, the timeline is moving: After the political agreement on the AI Omnibus, rules for certain high-risk areas are set to apply from 2 December 2027, and product-integrated high-risk systems from 2 August 2028. The Council of the EU gives the same dates in its press release on the provisional agreement. I would not anchor too much on the exact deadline, though. The deadline may move; the gaps in the pipeline remain.

Evidence, Not Just Logs

I prefer starting with four plain questions before getting lost in articles and obligations:

Lineage: What data shaped this decision?
Tracking: Which model, parameters, metrics, and environment belong to this release?
Explainability: Why did this specific result happen for this specific person?
Monitoring: Is the system still behaving the way it was signed off?

Most teams have partial answers somewhere. A bit is in Git, a bit in MLflow, a bit in a ticket, a bit in a notebook, and a slightly worrying amount in people’s heads. That is usually the work: not buying a new tool first, but connecting the traces you already have so a future reviewer is not guessing.

Lineage Does Not End at the Dataset

A familiar uncomfortable moment: The model drifts in production, someone wants to compare it with the training state, and suddenly nobody is sure which data was actually used. Not roughly. Not “that table from back then”. Exactly.

DVC, LakeFS, Pachyderm, or MLflow Datasets are useful here because they make data states versionable. But that only gets you partway. dvc checkout reconstructs data; it does not reconstruct a decision. To reconstruct an actual decision, four things belong together:

the data hash,
the Git SHA of the code,
the locked environment, such as uv.lock or requirements.txt,
the training and decision parameters.

The last part is easy to underestimate. Same data, same code, but a different scikit-learn version, and the decision boundary can move. The environment is not a footnote. It is part of the evidence.

Explanations Belong to the Run

Another less dramatic but equally dangerous case: Somewhere in the code there is a threshold of 0.4. Nobody on the current team remembers exactly why. The engineer who set it left two years ago. Still, that number helps decide every day who enters human review and who does not.

This is where MLflow is a handy anchor for me. A training run can be more than a place for metrics. It can keep the model, parameters, SHAP artifacts, model-card information, and lineage hints in one place. The explanation is no longer a loose screenshot in a notebook; it travels with the model.

SHAP waterfall plot from the PyData London 2026 credit scoring demo — The SHAP waterfall from the demo. Helpful, but not the whole explanation.

A SHAP plot is not compliance, of course. It is one piece of evidence. And even SHAP deserves care: Path-dependent and interventional SHAP answer different questions. For a regulated, appealable credit decision, I would defend interventional SHAP with a documented, versioned background dataset. More important than the exact choice is that the choice is deliberate and recorded.

The same applies to the threshold. “0.4 because it is in the code” should turn into something you can explain: calibration, cost matrix, fairness review, sign-off. Then the number is no longer magic. It is a decision with a paper trail.

Monitoring Is More Than Uptime

The third point is almost boring until it matters. The dashboard is green: Latency is fine, error rate is fine, uptime is fine. The model is “up”. But that says very little about whether it is still the model you signed off.

Drift report slide from the PyData London 2026 demo showing four changed features and an average score drop — The drift slide from the demo: 4 of 10 features moved, and the average score dropped from 684 to 644.

Evidently, NannyML, WhyLabs, or a custom Prometheus setup can make feature drift visible. The choice of tool is less important than the habit: Something has to watch decisions, not just infrastructure.

Feature drift is only the visible part. Outcome drift is where it gets practically and legally interesting. Are approval rates diverging between groups? Is the human override rate falling so low that human oversight has become rubber-stamping? A human who never disagrees is not oversight. They are part of the latency.

The Small Monday List

The evidence-readiness checklist from the talk is intentionally boring. It does not promise that a team can be “compliant” by Monday morning. It is there to help find the three most obvious gaps.

I would start bluntly: Is the risk classification documented? Can I reconstruct a training run with data, code, environment, and parameters? Are explanations stored as model artifacts? Do we have drift monitoring, fairness metrics, and a clear incident path?

That sounds less like grand regulation and more like good MLOps hygiene. Exactly. The AI Act does not invent many of these practices for high-risk systems; it makes them harder to treat as optional.

What Remains

The point I keep after the talk is simple: The EU AI Act does not force ML teams to reinvent their entire toolchain. It forces them to make the things they already claim provable.

Which data? Which code? Which model? Which explanation? Which release decision? Which production behaviour? Once those questions are answerable, regulation starts to feel less like something bolted onto engineering and more like a pretty good description of the system you would want anyway.

The old question was: Do we have logs? The better question remains: Can we reconstruct why?

All talk resources are here: slides, notebook, drift report, checklist, and tool links.

Your ML Pipeline Meets the EU AI Act

The Question That Sticks

Why Credit Scoring?

Evidence, Not Just Logs

Lineage Does Not End at the Dataset

Explanations Belong to the Run

Monitoring Is More Than Uptime

The Small Monday List

What Remains

Recent Posts

AI Agents in Production: Who Controls the Runtime?

Reduce LLM Calls with Vector Similarity Search

RAISE 2025: Between AI Sovereignty, Super Agents and Sweet Macarons in Paris