Final Project: Multimodal Classification Challenge

You receive multimodal features and raw text from an undisclosed source domain and must predict 18 anonymized labels per example. Previous courseworks prescribed methods; here you decide the methodology yourselves. The leaderboard ranks teams; how you handle test-time modality dropout and the few-shot regime is the design question being graded.
CS3317 — Artificial Intelligence 40% of the grade Multi-label Multimodal F1-macro Teams of 1–3 4 weeks
Group registration
May 29, 23:59
Friday · register on the leaderboard · gates Canvas team creation
Leaderboard closes
Jun 12, 20:00
Friday · Beijing time · no submissions after
Code + report + slides
Jun 14, 23:59
Sunday · single zip on Canvas
Worth
40%
of the course grade · 100 points + 10% presentation
Metric · team size
F1-macro
Teams of 1–3 · 5 valid subs/day · 50 total
Required action · by May 29

Form your team and register on the leaderboard by May 29, 23:59

Teams are 1–3 students. Once you register on the leaderboard, the instructor creates your Canvas team accordingly — only then can you upload the final zip on Canvas. Don't wait until June — late registrations may miss the Canvas-team window.

Open the leaderboard
Required reading

Presentation requirements — read before the last class

All team members speak. Slides are in English; oral delivery is in Chinese or English. The talk is 8 minutes + open Q&A, counts as 10% of the course grade (separate from the project's 100 pts).

Open the rules

!The methodological twist — read this first

At test time, ~30% of examples will have one randomly-selected modality replaced with zeros, indicated by the mask. Training data is clean — consider how to bridge this train/test distribution gap.

A baseline that just concatenates features and runs an MLP will work on intact examples but degrade sharply on dropout examples, because it never saw zero rows during training. What you do about this is the project. Some sensible responses (none required, all in scope): train-time modality dropout simulation, per-modality classifiers with mask-aware late fusion, learned modality gates, or imputation models for the missing modality.

1Downloads & submission

  • Coursework handout (PDF) — 6 pages with figures.
  • Data + starter bundle (.tar.gz, ~80 MB):
    • data/ — all CSV and NPZ files live here:
      • data/train.csv, data/val.csv, data/unlabeled.csv, data/test.csv — example_id, text_raw, labels or mask
      • data/{split}_image.npz, data/{split}_text_CLIP.npz, data/{split}_text_BERT.npz — pooled 768-d features
    • starter/load_data.py, evaluate.py, baseline_concat_mlp.py, submission_template.csv
    • tests/sanity_checks.py — verifies the bundle and a submission CSV
    • docs/report_template.tex & docs/report_template.docx — NeurIPS-style report templates
  • Leaderboard: http://taohuang.info:3317 — register your team, upload submissions, see your rank. Register by May 29, 23:59 so the instructor can create your Canvas team in time.
  • Canvas: a single zip with code, report.pdf, and slides.pdf — submitted after the leaderboard closes, by Jun 14 23:59. The leaderboard handles the test submission; Canvas handles the write-up. You can only upload the Canvas zip once the instructor has created your Canvas team from your leaderboard registration.

Keep exploring — after the deadline

The official leaderboard is closed, but if you'd like to keep working on the problem you can still evaluate new ideas — offline or against the hidden test.

  • Self-evaluation kit: selfeval_kit.tar.gz — a small numpy-only kit that now includes the hidden-test answer key, so you can reproduce your exact leaderboard score offline (python score.py --truth test_labels.csv --pred your_submission.csv), with the per-mask-subset breakdown. Also scores against val and can build a hidden-test-like local proxy. See the kit's README.md. (The labels are anonymized 18-class vectors — the genre mapping and movie identities stay private.)
  • Practice sandbox leaderboard: http://taohuang.info:13317 — scores against the same hidden test as the official board, so you can see your real F1. It is a practice sandbox, separate from the official (closed) board; results do not affect any grade. Register with your enrolled student ID and upload a submission CSV.

2Learning objectives

  • Reason about multimodal fusion and choose between early, late, and gated fusion architectures based on the task structure.
  • Design for a known train/test distribution gap; quantify your method's robustness via stratified ablations.
  • Use a small labeled set, a large unlabeled pool, and (optionally) LLM APIs as engineering components — while disclosing all of them.
  • Communicate methodology in a NeurIPS-style report and an 8-minute talk + open Q&A.

3The task & data

You receive four splits of the same underlying dataset. Every example has the same per-modality content: an image-feature vector, two text-feature vectors (CLIP- and BERT-encoded), and the raw text the features were computed from. Labels are multi-hot vectors of length 18; classes are anonymized integers label_0 through label_17.

Splits

SplitSizeLabels?Test-time modality dropout?
train2,000yesno (clean)
val300yesno (clean)
unlabeled5,000nono (clean)
test2,000no (graded online)yes — 30% have one modality zeroed

CSV schemas

FileColumns
data/train.csv, data/val.csvexample_id, text_raw, label_0, …, label_17
data/unlabeled.csvexample_id, text_raw
data/test.csvexample_id, text_raw, has_image, has_text

Row i of every .npz file for a split corresponds to the i-th data row of {split}.csv. The CSV's example_id column is the canonical mapping.

Test-time dropout details

When has_image=0, the row in data/test_image.npz is all zeros. When has_text=0, BOTH text feature files have zero rows there and text_raw is the empty string. The mask is provided so you can identify these examples; the zeroing is already applied.

4Setup & first run

tar -xzf student_bundle.tar.gz
cd student_bundle/
pip install numpy torch tqdm
python -m starter.baseline_concat_mlp     # ~1 minute, writes submission.csv
python -m tests.sanity_checks --submission submission.csv

The baseline is the floor you must beat — it is the simplest concat-MLP and does not know about test-time modality dropout. The remaining four weeks are about beating it.

5Compute — SJTU HPC (交我算)

A teaching HPC account on SJTU 交我算 has been requested for every student in the course. Using it is optional, not required — the project is small enough to run on a laptop, but if you'd like a GPU, this is your option.

Quick links

Your account

The instructor has sent each student a personal username and password via Canvas 私信 (direct message). If you did not receive the message, email the instructor at t.huang@sjtu.edu.cn.

Teaching-account quotas & rules (from the HPC operations team)

为避免计算资源浪费,教学账号限制同时运行的作业数量 1 个、CPU 节点数 1 个、GPU 卡数 1 卡、作业最长运行时间 12 小时
请优先使用 π2.0 集群 完成作业。登录节点为 pilogin, 请务必提醒学生 不要在登录节点运行作业,否则将会被封禁。
CPU 作业请使用 cpu 队列,GPU 作业请使用 dgx2 队列 (单卡拥有 32 G 显存)。目前集群 GPU 资源紧张,可能会出现排队的现象, 请学生妥善安排作业提交时间。

English summary. To avoid wasting shared resources, each teaching account is capped at 1 concurrent job, 1 CPU node, 1 GPU card, and a 12-hour wall-clock limit per job. Prefer the π2.0 cluster. The login node is pilogindo NOT run jobs on the login node, or your account will be banned. Use the cpu queue for CPU work and the dgx2 queue (32 GB per card) for GPU work. GPU resources are currently tight; plan your job submissions accordingly.

6Submission format & leaderboard

A CSV with exactly this header and 2,000 data rows (one per test example, any order):

example_id,label_0,label_1,...,label_17
ex_00123,0.81,0.04,...,0.03
...
  • Values may be probabilities in [0,1] or binaries in {0,1}; the leaderboard thresholds probabilities at 0.5 by default.
  • F1-macro is the leaderboard metric. F1-micro is reported but does not affect rank.
  • Each team is anonymized on the public leaderboard — only your group nickname is shown.
  • 5 submissions per day per team, 50 total over the four weeks. Format-error submissions that fail to score don't count against your quota.
  • Scoring is immediate. Rank by best F1-macro; ties broken by whoever reached the score first.

7Timeline

Week 1
Bundle released. Register your team on the leaderboard. Run the baseline to get a first leaderboard score so you have something to beat.
Week 2
Public leaderboard fully open. Iterate on your method — try ablations, missing-modality handling, unlabeled-pool tricks.
May 29
23:59 — group registration deadline. Form your team and register on the leaderboard by this date so the instructor can create your Canvas team in time for the final-zip submission.
Week 3
Open Q&A on Canvas. No formal checkpoint — keep iterating.
Jun 12
20:00 — leaderboard closes. Your best F1-macro by then is your final leaderboard rank.
Jun 14
23:59 — single zip due on Canvas containing code, report.pdf, and slides.pdf.
Last class
8-minute presentation + open Q&A. All team members speak, slides in English, oral delivery in Chinese or English. See the presentation requirements page for the full rules. Graded separately as the 10% course-level presentation component.

8Hard constraints

  • Total model parameters ≤ 100M at inference time, including any pretrained weights loaded into memory. This rules out fine-tuning a large pretrained backbone; it does not rule out using one via API (see below).
  • No external datasets beyond what's provided in the bundle.
  • No attempts to deanonymize the source domain. Detected attempts are an automatic fail. The features are pre-computed; you do not need to know the source.
  • LLM API calls are allowed (paid or free), but must be disclosed in your report: which provider/model, where in your pipeline, approximate cost (CNY). Pure prompting alone will be marked low on Method Design.

9What to put in the report

Max 8 pages, NeurIPS-style. Use docs/report_template.tex or docs/report_template.docx. The grading rubric weighs each section explicitly — the items below are required.

  1. Method Choice Rationale (one paragraph): why this method, what was rejected and why.
  2. Approach: pipeline figure; how you handle missing modalities; how you used the unlabeled pool (if at all); LLM/API usage disclosure.
  3. Empirical Analysis: per-modality ablation (image only / text only / both) and per-subset F1 on test stratified by the mask (intact / image-dropped / text-dropped).
  4. Resources Used: compute hours, hardware, API providers/models, approximate cost in CNY.
  5. AI usage: itemize every AI tool and exactly what each accomplished (see Academic Integrity below).

10Grading rubric (100 points)

Method design & implementationArchitectural choices, training procedure, code quality.
25/100
Empirical analysisAblations, per-modality results, missing-modality robustness.
25/100
Report qualityClarity, figures, scientific reasoning.
20/100
Final leaderboard tierTop 10% = 15 · top 30% = 12 · top 50% = 9 · any valid submission = 6 · none = 0.
15/100
Method Choice RationaleDefending what you chose and what you rejected, in writing.
10/100
ReproducibilityCode runs, seeds fixed, results match the report's numbers.
5/100
Total
100

The 8-minute presentation + Q&A on the last class is graded separately as the 10% course-level presentation component — it is not part of these 100 project points.

11Submission checklist

  • Registered your team on the leaderboard (one account per team).
  • sanity_checks.py passes for both the bundle and your final submission CSV.
  • Final submission uploaded to the leaderboard before Jun 12, 20:00.
  • Single zip uploaded to Canvas before Jun 14, 23:59 containing:
    • code/ with a runnable README and pinned random seeds
    • report.pdf — max 8 pages, all required sections including AI usage
    • slides.pdf — presentation slides

!Academic integrity

  • Discussing high-level ideas with other teams is allowed. Sharing code, submissions, model checkpoints, or labels is not.
  • Your final submission must come from your own implementation.
  • Any attempt to deanonymize the source domain or recover test labels (e.g. reverse-searching the raw text on the web for the purpose of identification) is a violation and an automatic fail for the project.

AI usage disclosure (required)

AI tools — LLM APIs in your pipeline, coding assistants (Cursor, Claude Code, GitHub Copilot, …), report-writing assistants, autonomous coding agents — are permitted. They are not a free pass on accountability.

In your report, include an "AI usage" subsection that itemizes every tool you used and exactly what each accomplished. Be specific. Examples of adequate disclosure:

  • "Cursor wrote the data-loading boilerplate in load.py and the argparse plumbing in train.py; we wrote the model and training loop."
  • "Claude API: one chat session to debug a numpy shape mismatch in our training loop. ~3 turns, ~CNY 0.20."
  • "GPT-5 helped revise wording in the abstract and the discussion; we wrote the methodology and analysis sections."

Examples of inadequate disclosure: "I used ChatGPT", "some parts were AI-assisted". If a Q&A question reveals that you cannot explain a component you claimed as your own work, that is a violation.

The deliverable rule: the design decisions must be yours, and you must be able to defend every part of your submission in person.