Kirsten Odendaal

When 2D Meets 3D: Rethinking Medical Segmentation Through MNet

mnet

A concise technical walkthrough of our MNet replication and architectural improvements


Modern medical imaging—MRI and CT in particular—often suffers from anisotropy: slices may be spaced far apart along the z-axis while in-plane resolution remains high. This mismatch creates a persistent challenge for 3D segmentation models. Pure 3D CNNs tend to overfit sparse depth information; pure 2D CNNs ignore valuable volumetric context.

MNet was proposed as a hybrid 2D/3D architecture that blends both perspectives at every stage of the network. In this project, we reproduced MNet from scratch, validated its core claims, and designed two lightweight extensions—Fusion Gating and VMamba—to push the architecture further.

Below is a brief look at what we reproduced, how we improved it, and what we learned.


Why MNet?

MNet introduces a mesh-like U-Net architecture with parallel 2D and 3D encoder-decoder branches. Instead of choosing between 2D or 3D convolutions, MNet blends both inside each block using manually defined fusion rules (add / subtract / concatenate).

According to the authors, this yields strong robustness to anisotropy across MRI and CT datasets.

Our goals were:

  1. Reproduce MNet and verify the original performance trends.
  2. Test its robustness under controlled z-spacing perturbations.
  3. Extend MNet by introducing:

    • Fusion Gating → learnable 2D–3D blending
    • VMamba → efficient long-range depth modeling via state-space modules

Experimental Setup

We used nnU-Net v1 pipelines for preprocessing, augmentation, and patch sampling—matching the original implementation as closely as possible. Datasets:

Training was performed on 16-GB GPUs (Lightning.ai), so long schedules (500 epochs) were not feasible. Instead, we trained for 150 epochs, and validated that PROMISE converges by epoch ~100–120.


Reproducing MNet

Our reproduction closely matched the original paper’s reported numbers:

PROMISE12

LiTS

Conclusion: We successfully reproduced MNet within acceptable reproducibility margins (±2–3 Dice), confirming the architecture’s validity.


Extension 1: Fusion Gating

The original MNet uses fixed operations to merge 2D and 3D feature streams. But fixed rules cannot adapt to variations in local anatomy, noise, or slice spacing.

We replaced them with learnable gates that estimate “trust” between 2D and 3D features either:

Gating performs soft interpolation:

[ y = g \odot x_{2D} + (1-g) \odot x_{3D} ]

Result:


Extension 2: VMamba for Depth Modeling

3D convolutions struggle with long-range dependencies when z-resolution is coarse. Attention-based models help, but are expensive in 3D.

We integrated VMamba, a state-space model performing efficient z-axis scanning:

Results:


Results Summary

✔ Reproduction Success

✔ Extensions Help

✔ Robust to Anisotropy

Across PROMISE (1–4 mm spacing), Dice only dropped ~1.5 points for extended variants, outperforming the baseline’s ~2–3 point drop.


Key Takeaways


Limitations & Future Directions

Due to compute limitations, LiTS results were based on a 50-case subset and 150-epoch schedules. With full dataset access and extended training, we expect larger gains—especially for VMamba.

Promising next steps:


Closing Thoughts

This project shows that MNet remains a strong and elegant baseline for anisotropic medical segmentation—but also that small, thoughtful architectural tweaks can further improve robustness and generalization.

By adding adaptive fusion and efficient global depth modeling, we demonstrate meaningful gains without sacrificing MNet’s lightweight nature.


Note: If the embedded PDF is not displayed properly or if you are viewing this on a mobile device, please click here to access the PDF directly.