Show HN: Run TRELLIS.2 Image-to-3D generation natively on Apple Silicon

Posted by shivampkumar 1 day ago

19734Original

I ported Microsoft's TRELLIS.2 (4B parameter image-to-3D model) to run on Apple Silicon via PyTorch MPS. The original requires CUDA with flash_attn, nvdiffrast, and custom sparse convolution kernels: none of which work on Mac.

I replaced the CUDA-specific ops with pure-PyTorch alternatives: a gather-scatter sparse 3D convolution, SDPA attention for sparse transformers, and a Python-based mesh extraction replacing CUDA hashmap operations. Total changes are a few hundred lines across 9 files.

Generates ~400K vertex meshes from single photos in about 3.5 minutes on M4 Pro (24GB). Not as fast as H100 (where it takes seconds), but it works offline with no cloud dependency.

https://github.com/shivampkumar/trellis-mac

Comments

Comment by sebakubisz 1 day ago

This is the kind of porting work I always hope for when I see a CUDA-only release. Have you thought about publishing the gather-scatter sparse 3D convolution and SDPA attention swaps as a standalone toolkit or writeup? A lot of folks running models locally on Apple Silicon hit the same wall with flash_attn, nvdiffrast, and custom sparse kernels and end up redoing the same work.

Comment by shivampkumar 1 day ago

that makes so much sense...I am exploring if I can find someone who has done this well...If not I'll try to do it myself.

Comment by sergiopreira 1 day ago

Most 'runs on Mac' ports are a wrapper around a cloud call or a quantized shell of the original model. Going after the CUDA-specific kernels with pure-PyTorch alternatives is the kind of work that ages well, because the next CUDA-locked research release is three weeks away. One question: how much of the gather-scatter sparse conv is reusable for other TRELLIS-like architectures, or is it bespoke to this one?

Comment by gondar 1 day ago

Nice work. Although this model is not very good, I tried a lot of different image-to-3d models, the one from meshy.ai is the best, trellis is in the useless tier, really hope there could be some good open source models in this domain.

Comment by shivampkumar 1 day ago

Hey, thanks for sharing this. I'm sure TRELLIS.2 definitely has room to improve, especially on texturing.

From what I've seen personally, and community benchmarks, it does fair on geometry and visual fidelity among open-source options, but I agree it's not perfect for every use case.

Meshy is solid, I used it to print my girlfriend a mini 3d model of her on her birthday last year!

Though worth noting it's a paid service, and free tier has usage limitations while TRELLIS.2 is MIT licensed with unlimited local generation. Different tradeoffs for different workflows. Hopefully the open-source side keeps improving.

Comment by isoprophlex 1 day ago

Meshy is indeed great but I am terminally put off by their alltogether terrible, sleazy, gamified, opaque web UI. It's like aliexpress and a lootbox game had a baby that's into mesh generation. Ugh.

Comment by jseabra 22 hours ago

[dead]

Comment by strimoza 21 hours ago

Cool project. I've been working on something similar in spirit — a personal video cloud (strimoza.com) — and the hardest part was also getting local playback to work reliably without internet. How are you handling memory pressure on the M-series chips with larger models?

Comment by pcoyne 20 hours ago

I wonder how these outputs compare between Apple Sharp

https://github.com/apple/ml-sharp

No matter what it is cool seeing so much them work on different devices

Comment by vunderba 18 hours ago

They’re completely different projects. ML Sharp is designed to create a 3D Gaussian representation of a depicted scene and output a splat file. Trellis generates model files (GLB I believe) for use with 3D modeling applications.

Comment by petargyurov 1 day ago

This is fantastic, great work. I will attempt to run it on my 16GB M1 but I doubt it'll run.

Out of curiosity, how did you go about replacing the CUDA specific ops? Any resources you relied on or just experience? Would love to learn more.

Comment by kennyloginz 1 day ago

So much effort, but no examples in the landing page.

Comment by shivampkumar 1 day ago

You're right, thanks for flagging this, let me run something and push images

Comment by shivampkumar 1 day ago

added! will add more, maybe even a GIF

Comment by post-it 1 day ago

How much RAM does this use? Only sitting on 8 GB right now, I'm trying to figure out if I should buy 24 GB when it's time for a replacement or spring for 32.

Comment by shivampkumar 1 day ago

The model needed about 15GB at peak during generation - the 4B model loads multiple sub-models (1.3B each for shape and texture flow). 8GB won't be enough, but both 24GB and 32GB both should be fine.

Comment by post-it 1 day ago

Thanks! Could it conceivably load the sub-models in series rather than parallel? 8 still won't be enough but I wonder if those with 16 could eke something out.

Comment by Serhii-Set 1 day ago

[dead]

Comment by paolatauru 19 hours ago

solid port. the sdpa swap for sparse attention — did you notice a meaningful quality difference, or is it basically equivalent to the cuda version? curious if the pure-pytorch path added any noticeable latency hit on the m3 max

Comment by drbscl 22 hours ago

Does it support multi-view input?

Comment by antirez 1 day ago

Great. Potentially can go much faster rewriting it in terms of Metal shaders.

Comment by villgax 1 day ago

That’s always been possible with MPS backend, the reason people choose to omit it in HF spaces/demos is that HF doesn’t offer an MPS backend. People would rather have the thing work at best speeds than 10x worse speeds just for compatibility.

Comment by shivampkumar 1 day ago

IMO TRELLIS.2 is slightly different case from the HF models scenario. It depends on five compiled CUDA-only extensions -- flex_gemm for sparse convolution, flash_attn, o_voxel for CUDA hashmap ops, cumesh for mesh processing, and nvdiffrast for differentiable rasterization. These aren't PyTorch ops that fall back to MPS -- they're custom C++/CUDA kernels. The upstream setup.sh literally exits with "No supported GPU found" if nvidia-smi isn't present. The only reason I picked this up because I thought it was cool and no one was working on this open issue for Silicon back then (github.com/microsoft/TRELLIS.2/issues/74) requesting non-CUDA support.

Comment by Reubend 1 day ago

Are you saying the original one worked with MPS? Or are you just saying it was always theoretically possible to build what OP posted?

Comment by villgax 21 hours ago

Latter

Comment by refulgentis 1 day ago

It’s always been possible, but it’s not possible because there’s no backend, and no one wants to it to be possible because everyone needs it 10x the speed of running on a Mac? I’m missing something, I think.

Comment by shivampkumar 1 day ago

I thought it was cool and then I found the open issue mentioned above, that convinced me its def something more people want.

It IS significantly slower, about 3.5 minutes on my MacBook vs seconds on an H100. That's partly the pure-PyTorch backend overhead and partly just the hardware difference.

For my use case the tradeoff works -- iterate locally without paying for cloud GPUs or waiting in queues.

Comment by jmatthews 1 day ago

Well done

Comment by serf 1 day ago

rad. how long does output take? trellis is a fun model.

Comment by shivampkumar 1 day ago

i was able to get it in 3.5 mins from a single image on my 24gb m4 pro macbook

I'm still working on this to try to replicate nvdiffrast better. Found an open source port, might look it tonight

Comment by shivampkumar 1 day ago

thanks!

Comment by Olivia_Pan 6 hours ago

[dead]

Comment by techpulselab 1 day ago

[dead]

Comment by jiexiang 1 day ago

[dead]

Comment by takahitoyoneda 22 hours ago

[dead]

Comment by vrr044 1 day ago

[dead]

Comment by hank808 1 day ago

[flagged]

Comment by refulgentis 1 day ago

Sunday night, and its kinda cool idk man

Comment by shivampkumar 1 day ago

I mean I can see that it's niche. Did not expect so many upvotes, but ig it's less niche than I tought

If you're not working with 3D on Apple Silicon this isn't relevant to you. For the subset of people who are, running this 4B parameter 3D generation model locally on a Mac was previously blocked by hard CUDA dependencies with no workaround.

Comment by svnt 1 day ago

Right but it is at most a couple of hours with claude code and posted on Sunday night.

Comment by atultw 22 hours ago

Exactly, I know because I did the same thing!

Comment by kennyloginz 1 day ago

Good question.