Third at the EuroTech Hong Kong Hackathon: A Wafer Defect Detector in 24 Hours

The EuroTech Hong Kong Hackathon happened in Munich, which is the first slightly confusing thing about it. The EuroTech Federation pulls students from a handful of European technical universities together, hands them a broad industrial brief, and gives them about a day to do something with it. The specific problem was ours to pick, and we landed on semiconductor inspection. None of us had really touched wafer imaging before that weekend.

Four of us teamed up, all from TUM, and by the end of roughly 24 hours we had a working defect-detection workbench and a third-place finish in the AI & Robotics track. This is the honest version of what we built, including the parts that were proper engineering and the parts that were duct tape.

I am going to go deeper than the usual hackathon recap, because the interesting part is not the trophy, it is the constraint we were boxed into and the slightly sideways way we got around it. If you have ever wondered how you do machine learning when nobody will hand you a labeled training set, this is a decent worked example.

The constraint that makes this hard

If you have never thought about semiconductor inspection, the obvious move is to train a classifier: feed it a pile of defective wafers and a pile of clean ones, let it learn the difference. The problem is getting that pile of labeled defects in the first place. For silicon carbide especially, the good annotated datasets are commercial and almost never released publicly, so you cannot just download them. We felt that the hard way: the one solid set of annotated SiC defect wafers we found was locked behind a university network so slow it kept timing out, and several other datasets we went looking for had simply vanished, no longer hosted anywhere. Even where data exists, somebody still has to find each defect, mark exactly where it is, and confirm its type, and that labeling is slow, expensive, and needs an expert. So you end up with mountains of imagery and almost no labels to learn from.

It gets worse than just scarcity. The defects you do have are wildly uneven. Some failure modes are common enough that you have seen a handful; others you might have one example of, or none, because they are new to that process line. A classifier trained only on the defects you happened to collect is quietly blind to a new failure mode the moment it shows up. That is a bad property for the thing standing between a flaw and a shipped chip.

So the framing has to change. With too few labeled examples to train a normal supervised model, you have to lean on what you do have, which is a decent sense of what a good wafer looks like. Clean wafers are abundant; that is the whole point of a working line. The interesting question stops being "what does every defect look like" and becomes "how far is this from normal". That reframing is the entire project.

Few-shot, and basically training-free

The workbench showing a wafer micrograph with a defect heatmap overlay and a dashed region-of-interest outline — The workbench overlaying a deviation heatmap on a wafer scan, with a region of interest the reviewer can drag around. The brighter the patch, the further it sits from normal.

The idea was Damià's, and the honest truth is none of us knew exactly how to pull it off when he pitched it. We got lucky: a computer-vision conference two days earlier had been full of papers on exactly this, so we had a fresh stack of references to crib from. The approach is what is called few-shot anomaly detection. Take a frozen DINOv2 encoder, one of Meta's pretrained vision models, and feed it a handful of clean reference images. DINOv2 was trained on a huge pile of general imagery, so it already knows how to turn a picture into a rich numeric representation where similar-looking things land near each other. We never retrain it. We just borrow that sense of visual structure.

From just those few "normal" examples you fit a compact model of what normal looks like in that representation, a kind of subspace the clean references all sit inside. Then you score everything else by how far it falls outside that subspace. A patch that looks like the references scores low; a scratch, a particle, a cracked region scores high because the encoder puts it somewhere the clean examples never go. Because the scoring is done patch by patch across the image, you do not just get a number, you get a heatmap showing where on the wafer the model is unhappy.

The genuinely nice property is that it sharpens as you feed it more reference images, without any training. On a public benchmark we used as a stand-in, the localization quality climbed steadily as the number of reference shots went up, from rough at a single example to sharp at four. That is a comfortable place to be operationally: you do not need a labeling project, you need a few more known-good wafers, which is the one thing a working line has plenty of. And there is no checkpoint to babysit, no gradient training, no overfitting to whichever defects you happened to collect. The only weights in the whole thing are the frozen encoder.

What the numbers actually said

Few-shot scoring is the clever part, but we also wanted at least one result on real target data rather than a proxy. The one solid annotated SiC set, the one stuck behind that crawling university network, finally came through after a long struggle, and on those real 4H-SiC photoluminescence scans the team fine-tuned a YOLO11 oriented-box detector. It pushed precision from 58.5% to 66.9% and mAP50-95 from 36.2% to 39.9% over the baseline.

Those are not headline numbers and I am not going to dress them up. They are modest, honest movement for a weekend, and importantly they were measured on a split grouped by acquisition session rather than a random one. That distinction matters more than it sounds: if tiles from the same scan leak across your train and test sets, your metrics flatter you and mean nothing. We took the harder, more honest split, and one defect class in particular stayed stubbornly weak.

We were also careful about what we did and did not have. Part of the live demo ran on a semiconductor SEM proxy dataset rather than validated SiC production data, and we wrote that down explicitly, in the repo, in a document about what was real versus a stand-in. Drawing that line was, I think, part of why the judges took it seriously. It is an industrial inspection track. The people evaluating you know the difference between a result and a vibe, and a team that marks its own limitations reads as one you could actually pilot something with.

From a score to a decision

A deviation score on its own is not useful to a tired engineer at the end of a shift. A raw number like 0.73 means nothing without context, so the last step is turning scores into a decision. We calibrated the scores against the distribution of clean references, so instead of an arbitrary threshold you get a sense of how unusual something is relative to known-good wafers, and we bucketed the result into three plain outcomes:

Pass. Sits comfortably inside normal. Nobody needs to look at it.
Review. Far enough from normal to be worth a human glance, with the heatmap pointing at exactly which region raised the flag.
Hold. Clearly off, pulled out for proper attention before it goes any further.

The point of the three buckets is to spend human attention where it actually pays off. Most wafers are fine, and a reviewer who has to eyeball all of them will miss the rare bad one out of sheer boredom. Sorting the obvious passes away from the genuinely suspicious is a small idea that turns a model output into something a person can act on, which is the difference between a demo and a tool.

Wrapping it in something you can actually use

A scoring function in a notebook does not win anyone over, so a big chunk of the weekend went into the workbench around it: a Next.js dashboard where you load a wafer, see the heatmap, get a score and a triage call, and crucially follow the trail back to the evidence behind every decision. Which reference images defined normal, which preprocessing ran, which model configuration produced the call. For an inspection tool that matters more than a pretty UI, because an engineer is not going to trust a number they cannot interrogate, and "the AI said so" is not an answer anyone signs off on.

My corner of it, and the people who carried it

The team sitting together in the venue, grinning at the camera — Somewhere in hour twenty, and still grinning.

I want to be straight about who did what, because it was a team and the ML depth was not mine. My share was the frontend and product side: building out the workbench, getting the demo to hang together under pressure, and keeping the repository in a shape we could actually present from. The deep modeling work, the DINOv2 and PCA pipeline, the validation, the real SiC detector, was Damià and Jakob together on the ML side. Sparsh shaped the pitch and the business framing that tied a pile of metrics into a story a judge could follow in two minutes.

Working with people who are genuinely better than you at their part of the problem is the best way to spend a weekend. You move faster, you cut your worst ideas earlier, and you absorb a domain by osmosis that would have taken weeks alone. A good chunk of what I now know about how chips are inspected I learned in twenty-four hours sitting next to someone explaining it while we both stared at a wafer scan that would not cooperate.

Twenty-four hours, and then a stage

The build itself was the usual hackathon arc, compressed. A confident plan in the first few hours, where everything seems obvious and the architecture is clean in your head. Then the long middle, the quiet panic where the data does not load the way you assumed, the model scores garbage, and the demo you sketched on a napkin is suddenly three integration bugs away from existing at all.

And then the stretch near the end where it comes together faster than it fell apart, and you remember why you keep doing this to yourself. We got the heatmaps rendering, the triage calls landing, the evidence trail clicking through, and a pitch that walked from the problem to the few-shot idea to the real numbers without overselling any of it. Two minutes on stage, a round of questions we had actually prepared for, and then the wait.

Third place in AI & Robotics. For a domain we had met that same weekend, against teams who clearly knew their corner of it cold, I will take that happily. Standing on a stage with a screen full of fireworks behind a grinning team is a very specific kind of good.

The team on stage in front of a screen reading 3rd Place at the EuroTech Hong Kong Hackathon — Third place in the AI & Robotics track. The fireworks on the screen were not ours, but we will borrow them.

What I took away

A few things stuck. The first is how completely foundation models have changed what a small team can do with almost no labeled data. A frozen encoder and a handful of clean images got us a usable defect signal in a domain we did not know, which would have been a research project, not a weekend, only a few years ago. The leverage these models hand you is genuinely strange once you feel it directly.

The second is that honesty is not the opposite of a good pitch. I went in half expecting that admitting our proxy data and our weak class would cost us, and the opposite happened. Being precise about what was real versus stand-in made the whole thing more convincing, because it signalled that the results we did claim were ones we actually trusted. Rigor reads as confidence, not weakness, to anyone who knows the field.

And the last one is just that a good team beats a clever idea. We had both this time, but if I had to pick, I would take the four people over the architecture every weekend. Thanks to the EuroTech Federation for putting it on, and to Damià, Jakob, and Sparsh for making twenty-four hours genuinely fun.