CS499/599 | AI539 :: S26 :: Trustworthy ML

Must Read

This course uses GitHub Classroom. The invitation link is posted on Canvas (not here, to restrict access to enrolled students). Accept the assignment to get your personal repository — one repo is shared across all 4 homeworks, and you build on it throughout the term.
How to submit: Push your code to your repository — that is your submission. The autograder runs automatically on each push and tests your implementation.
- Datasets (MNIST, CIFAR-10, etc.): Do NOT commit. They are downloaded automatically at test time via download=True.
- Trained checkpoints (final.pth): You MUST commit and push these. The autograder loads them directly from your repository — if they are missing, your tests will fail immediately. After training, run git add checkpoints/*/final.pth and push. The .gitignore whitelist in your repo ensures only the required final.pth files are tracked and blocks everything else (e.g., epoch checkpoints, VGG16 weights which exceed GitHub's 100MB limit).
Write-up: Put your PDF report under the reports/ folder in your repository and push it together with your code.
Local testing: Run python -m pytest autograder/ -v to test your implementation before pushing.
Grades (autograder score + write-up review) will be posted on Canvas.

Homework Overview

The learning objective of this homework is for you to create a codebase to train and evaluate various deep neural network (DNN) models. You also need to analyze the impact of various factors (that you can control during training) on the final DNN models. You will use this codebase and the trained models to complete the homework assignments (HW 2, 3, and 4) throughout the term.

Initial Setup

To begin with, you can choose any deep learning framework that you're already familiar with (e.g., PyTorch, TensorFlow, or ObJAX). If you are not familiar with any of these frameworks, you can start with PyTorch or TensorFlow (> v2.0).

Datasets and DNN Models

We will limit our scope to three popular image classification datasets: MNIST [link], FashionMNIST[link] and CIFAR-10 [link] across homework assignments. Most deep learning frameworks support those datasets by default. We will also use three DNNs: LeNet [link], VGG16 [link] and ResNet18 [link]. I added the links to the original papers.

Task I: Train and Evaluate Your Models

The first task is simple; train 6 DNN models. You will train 3 DNNs (LeNet, VGG16, and ResNet18) on 2 datasets (MNIST and CIFAR-10). You need to measure your model's performance with two metrics: classification accuracy and loss. You can compute them on both the training and testing data.

Please compute those metrics every 5 epochs. Draw 2 plots for each model training: { epochs } vs. { training accuracy & testing accuracy } and { epochs } vs. { training loss & testing loss } [see this example plots].

Task II: Analyze the Impact of Your Training Techniques on Models

Now, let's turn our attention to how you train those 6 DNN models. You probably made various choices to train those models; for example, you may use cross-entropy to compute the loss of a model. Depending on how you train your models, they have slightly different properties. In this task, we will analyze the impact of various choices that you can make for training a model on its performance. Since this task may require training multiple DNN models, which takes some time, let's reduce our scope to two cases: (i) training LeNet on MNIST and (ii) ResNet18 on CIFAR10.

You can control the following things:

Data augmentations: transform the inputs of a neural network, e.g., cropping, resizing, flipping, shifting, ...
Model architectures: add additional layers to a neural network, e.g., adding Dropout before the classification head, ...
Optimization algorithm (or loss functions): choosing a different optimizer, e.g., SGD or Adam, or a different loss function.
Training hyper-parameters: batch-size, learning rate, total number of training iterations (epochs), ...

Let's compare models trained in the following 5 scenarios:

Data augmentation: Rotation: train your models with and w/o rotations and compare the plots.
Data augmentation: Horizontal flip: train your models with and w/o random horizontal flips and compare the plots.
Optimization: SGD/Adam: train your models with SGD or Adam and compare the plots.
Hyper-parameters: batch-size: train your models with two different batch-sizes and compare the plots.
Hyper-parameters: learning rate: train your models with two different learning rates and compare the plots.

You may (or may not) find a significant difference between the two models. Explain your intuitions on why you observe (or do not observe) them.

Submission Instructions

Push your code to your GitHub Classroom repository to submit. Place your write-up as a PDF under the reports/ folder and push it with your code. Your PDF write-up should contain the following things:

Task I

Your experimental setup: specify your training configurations such as your hyper-parameter choices.
Your 12 plots: 2 plots for each model, and you have 6 models.
Your analysis: write-down a summary (the acc. and loss of the models); provide 2-3 sentences explaining why you see the results.

Task II

Your 20 plots: 2 plots for each model, and you have 2 models for each of the five scenarios.
Your analysis: Provide 2-3 sentences for each scenarios explaining why you observe the result.

Must Read

This course uses GitHub Classroom. The invitation link is posted on Canvas (not here, to restrict access to enrolled students). Accept the assignment to get your personal repository — one repo is shared across all 4 homeworks, and you build on it throughout the term.
How to submit: Push your code to your repository — that is your submission. The autograder runs automatically on each push and tests your implementation.
- Datasets (MNIST, CIFAR-10, etc.): Do NOT commit. They are downloaded automatically at test time via download=True.
- Trained checkpoints (final.pth): You MUST commit and push these. The autograder loads them directly from your repository — if they are missing, your tests will fail immediately. After training, run git add checkpoints/*/final.pth and push. The .gitignore whitelist in your repo ensures only the required final.pth files are tracked and blocks everything else (e.g., epoch checkpoints, VGG16 weights which exceed GitHub's 100MB limit).
Write-up: Put your PDF report under the reports/ folder in your repository and push it together with your code.
Local testing: Run python -m pytest autograder/ -v to test your implementation before pushing.
Grades (autograder score + write-up review) will be posted on Canvas.

Homework Overview

The learning objective of this homework is for you to attack your models built in Homework 1 with white-box adversarial examples. You will also use adversarial training to build your robust models. We then analyze the impact of several factors—that you can control as an attacker or a defender—on the success rate of attack (or defense). You can start this homework from the codebase you wrote for Homework 1.

Initial Setup

Datasets and DNN Models

We will use the two datasets: MNIST [link] and CIFAR-10 [link]. But, we only focus on two DNN models: LeNet [link] and ResNet18 [link].

Recommended Code Structure

You will implement the PGD attack in attacks/PGD.py and write two driver scripts adv_attack.py and adv_train.py. The rest are the same as Homework 1.


                                    Root

                                    - [New] attacks/PGD.py  : implement the PGD attack function here.

                                    - [New] adv_attack.py : a Python script to run adversarial attacks on a pre-trained model.

                                    - [New] adv_train.py  : a Python script for adversarial-training a model.

                                    ...

Note

You may find off-the-shelf libraries, e.g., adversarial-robustness-toolbox [link], where you can plug-n-play attacks on your models. I do NOT recommend using any of those libraries for this homework. However, it is allowed to refer to the community implementations of attacks and defenses and re-write them in your hands. Remember: the important learning objective is to understand the attack internals and implement them.

Task I: Attack Your Models

Let's start with attacking your DNN models trained in Homework 1. We will attack your 2 DNNs: LeNet on MNIST and ResNet18 on CIFAR10. You need to use PGD [Madry et al.] as an adversarial example-crafting algorithm. Your job is to craft the PGD adversarial examples for all the test-time samples (i.e., 10k test-set samples for both MNIST and CIFAR10). To measure the effectiveness of your attacks, we will compute the classification accuracy on these adversarial examples. Make sure you attack the same DNNs that you used for crafting adversarial examples.

Here, you need to implement the following function in attacks/PGD.py.


                                def PGD(x, y, model, loss, niter, epsilon, stepsize, randinit, ...)

                                - x: a clean sample

                                - y: the label of x

                                - model: a pre-trained DNN you're attacking

                                - loss: a loss you will use

                                - [PGD params.] niter: # of iterations

                                - [PGD params.] epsilon: l-inf epsilon bound

                                - [PGD params.] stepsize: the step-size for PGD

                                - [PGD params.] randinit: start from a random perturbation if set true

                                // You can add more arguments if required

This PGD function crafts the adversarial example for a sample (x, y) [or a batch of samples]. It takes (x, y), a pre-trained DNN, and attack parameters; and returns the adversarial example(s) (x', y). Note that you can add more arguments to this function if required. Please use the following attack hyper-parameters as a default:

niter: 5
epsilon: 0.3 (MNIST) and 0.03 (CIFAR10)
stepsize: epsilon / 4 (i.e., 0.075 for MNIST, 0.0075 for CIFAR-10)
randinit: false

To measure the effectiveness of the adversarial examples, we will write an evaluation script in if __name__ == "__main__": in the same file. Here, for all the 10k adversarial examples crafted, you will compute the classification accuracy on the DNN model you used. Note that you will observe much less accuracy than what you can observe on the clean test-time samples.

Task II: Analyze the Impact of Several Factors on Your Attack's Success Rate

Now, let's turn our attention to several factors that can increase/decrease the effectiveness of your white-box attacks. In particular, we will vary: (1) the attack hyper-parameters (e.g., the number of iterations) and (2) the way we trained our DNN models (see Task II of Homework 1).

Subtask II-1: Analyze the Impact of Attack Hyper-parameters

We will focus on two attack hyper-parameters: niter and epsilon. Use the 2 DNNs in Task I (LeNet on MNIST and ResNet18 on CIFAR10).

(1) Set the number of iterations in {1, 2, 3, 4, 5, 10, 20, 30, 40, 80, 100}.
(2) Fix the iterations to 5, and set the epsilon to {0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1.0}.

Please use those different hyper-parameters and compute the classification accuracy of 2 DNN models on your adversarial examples. Draw plots: { # iterations } vs. { classification accuracy } and { epsilon } vs. { classification accuracy } and explain your intuitions on why you observe them.

Subtask II-2: Analyze the Impact of the Training Techniques You Use

One may think we can use some nice training techniques for reducing the effectiveness of white-box adversarial attacks. We plan to run some experiments to evaluate this claim. In particular, we're interested in four techniques across two categories: data augmentations (rotation, horizontal flip) and regularizations (Dropout, weight decay).

(1) Data augmentations: In Task II of Homework 1, we examine two simple augmentations: rotation and horizontal flips. We also have DNNs trained with/without those augmentations. On the 2 DNNs (LeNet on MNIST and ResNet18 on CIFAR10) trained with/without each data augmentation, craft adversarial examples on the test-set samples and measure the classification accuracy on them.
(2) Regularizations: We also examine two techniques: Dropout [link] and weight decay. Let's focus only on ResNet18 in CIFAR10.
- 1) To examine the impact of Dropout, we need to modify the ResNet18's network architecture. Add the Dropout layer before its penultimate layer and set the rate to 0.5. Train this modified ResNet18 (henceforth called ResNet18-Dropout). Craft adversarial examples on this model, measure the classification accuracy, and compare the accuracy to what we have with ResNet18 (w/o Dropout).
- 2) To examine the impact of weight decay, we will train ResNet18 with Adam optimizer [link] on CIFAR10. You will train 5 ResNet18 models trained with different weight decay values: {1e-5, 1e-4, 1e-3, 1e-2, 1e-1}. Please don't be surprised when you see bad accuracy with higher weight decay. Craft adversarial examples on those five DNN models and measure the accuracy on both the clean samples and adversarial examples. Compare how much accuracy you can decrease on each model.

You may (or may not) find that each technique increases/decreases the accuracy degradation caused by adversarial examples. Please write down the accuracy degradations and explain your intuitions on why you observe them in your report.

Task III: Defend Your Models with Adversarial Training

One way to mitigate adversarial attacks is to train your models with adversarial training (AT). Here, we will examine the effectiveness of AT.

Let's implement a script for AT. Make a copy of your train.py and name it adv_train.py. We will convert the normal training process into adversarial training. In train.py, we train a model on a batch of clean training samples (in each batch). Instead, you need to make adversarial examples on the batch of clean samples and train your models on them. Note that this is slightly different from the work by Goodfellow et al..

Please train 2 DNN models (LeNet and ResNet18) adversarially on MNIST and CIFAR10, respectively. Once you train those robust models, you require to craft adversarial examples and compute the accuracy. Note that we use the same attack hyperparameters as in Task I. Compare:

(1) How's your robust models' accuracy on adversarial examples compared to your undefended models?
(2) How's your robust models' accuracy on clean test-set examples compared to your undefended models?
(3) Let's increase the PGD attack iterations from 5 to 7. How's your robust models' accuracy changes?

Please explain your intuitions on why you observe them in your report.

[Extra +3 pts]: Use Your Adversarial Examples to Attack Real-world DNNs

You may be curious how much the adversarial examples that you crafted will be effective against the DNNs deployed in the real-world. Here are some real-world image classification demos [a simpl;e demo]. Please store 10 adversarial examples for each MNIST and CIFAR10 attack (Task I) to .png files. Upload them on one of the image classification demos and see how the predicted labels are different compared to your DNNs.

Please show your adversarial examples, the classification of them on your DNNs, and the predicted labels on the demo you chose in your report.

Submission Instructions

Push your code to your GitHub Classroom repository to submit. Place your write-up as a PDF under the reports/ folder and push it with your code. Your PDF write-up should contain the following things:

Task I

The classification accuracy of clean test-set samples on 2 DNNs (LeNet and ResNet18).
The classification accuracy of your adversarial examples on 2 DNNs.
Your analysis: write-down 2-3 sentences explaining why you see those results.

Task II

Subtask II-I

Your 4 plots: { # iterations } vs. { classification accuracy } and { epsilon } vs. { classification accuracy } on each of your 2 DNNs.
Your analysis: Provide 2-3 sentences for each case explaining why you observe the result.

Subtask II-II

Your analysis: Provide 2-3 sentences for each case explaining why you observe the result.

Task III

The classification accuracy of clean test-set samples on your robust DNNs.
The classification accuracy of your adversarial examples on your robust DNNs.
Your analysis: write-down 2-3 sentences for the three questions above.

[Extra +3 pts]

Your adversarial examples shown as images.
Their classification results on your DNN models.
Their classification results on the real-world DNNs.

Must Read

This course uses GitHub Classroom. The invitation link is posted on Canvas (not here, to restrict access to enrolled students). Accept the assignment to get your personal repository — one repo is shared across all 4 homeworks, and you build on it throughout the term.
How to submit: Push your code to your repository — that is your submission. The autograder runs automatically on each push and tests your implementation.
- Datasets (MNIST, CIFAR-10, etc.): Do NOT commit. They are downloaded automatically at test time via download=True.
- Trained checkpoints (final.pth): You MUST commit and push these. The autograder loads them directly from your repository — if they are missing, your tests will fail immediately. After training, run git add checkpoints/*/final.pth and push. The .gitignore whitelist in your repo ensures only the required final.pth files are tracked and blocks everything else (e.g., epoch checkpoints, VGG16 weights which exceed GitHub's 100MB limit).
Write-up: Put your PDF report under the reports/ folder in your repository and push it together with your code.
Local testing: Run python -m pytest autograder/ -v to test your implementation before pushing.
Grades (autograder score + write-up review) will be posted on Canvas.

Homework Overview

The learning objective of this homework is for you to perform data poisoning attacks on machine learning models (some of the attacks will require the neural networks trained in Homework 1). You will also test the effectiveness of simple defenses against the poisoning attacks you will implement. You can start this homework from the codebase you wrote in Homework 1.

Initial Setup

Datasets

We will use two datasets: MNIST-1/7 and CIFAR-10 [link]. MNIST-1/7 is a subset of MNIST that only contains samples from the class 1 and 7; the dataset is popular for a binary classification task. We can set the label for the class 1 to 0 and the class 7 to 1. Then, MNIST-1/7 becomes a binary classification problem with labels {0, 1}. Popular deep learning frameworks, such as PyTorch, support some sampling functionalities. Please search for some examples on Google about how to subsample classes from a dataset [Here is an example in PyTorch].

Models

Here, we consider two models: logistic regression [an example in PyTorch] for MNIST-1/7 and ResNet18 for CIFAR-10 [Link].

Recommended Code Structure

You will implement poisoning logic in craft_poisons.py and eval_clabel.py. Training on a poisoned dataset reuses train.py with the --poison-dir flag. The rest are the same as Homework 1.


                                    Root

                                    - [New] craft_poisons.py  : a Python script to craft poisoning samples.

                                    - [New] eval_clabel.py    : a Python script to evaluate the clean-label poisoning attack.

                                    - train.py (modified)     : use --poison-dir to train on a contaminated dataset.

                                    - [Extra] poison_remove.py: a Python script for removing suspicious samples (extra credit).

                                    ...

Task I: Poisoning Attack against Logistic Regression Models

Let's start with poisoning attacks against logistic regression models in MNIST-1/7. Here, we will conduct an indiscriminate poisoning attack. Your job is to construct contaminated training sets that can degrade the accuracy of a model once the model is trained on it.

We will use a simple poisoning scheme called random label-flipping. It constructs a contaminated training set by randomly flipping the labels of X% samples in the original training set. For example, you can select 10% of the MNIST-1/7 training samples (~1.7k) and flip their labels from 0 to 1 (or vice versa).

Your job is to construct four contaminated training sets where each set contains {5, 10, 25, 50}% of poisons. You will train five logistic regression models: four on each corrupted training set and one on the clean MNIST-1/7 dataset. Please measure how much accuracy degradation each attack causes compared to the accuracy of the model trained on the clean data.

Here, you need to implement the following function in attacks/lflip.py.


                                def craft_random_lflip(dataname, ratio, data_dir='./data'):

                                - dataname : dataset name string ('mnist', 'fmnist', or 'cifar10')

                                - ratio    : fraction of samples to poison (e.g., 0.1 for 10%)

                                - data_dir : dataset root directory

                                // Returns: (poisoned_train_set, clean_test_set)

This function constructs a poisoned training set that has ratio fraction of label-flipped samples. The dataname identifies which dataset to load, and the ratio is a number between 0 and 1. Note that this is an example of writing a function for crafting poisoned training sets. Please feel free to use your own function if that is more convenient.

To train on the poisoned dataset, use train.py with the --poison-dir flag pointing to your saved poisoned training set. For example: python train.py --dataset mnist --model logistic --poison-dir ./poisons/lflip_mnist_r0.1_train.pkl.

Task II: Poisoning Attacks on Deep Neural Networks

Now, let's turn our attention to attacking neural networks. As I explain in the lecture, deep neural networks are less susceptible to indiscriminate poisoning attacks, i.e., it's hard to degrade their accuracy unless we inject many poisons. We therefore focus on targeted poisoning attacks.

Your job here is to conduct Poison Frogs! [Link] attack on ResNet18 trained on CIFAR-10. You can refer to the author's code [TensorFlow] or the community implementations [PyTorch]. Be careful if you use those community code; there is a chance that they implement the attack incorrectly.

Instructions

We conduct this attack between two classes in CIFAR-10: frogs and dogs. Particularly, we aim to make a frog sample classified into a dog. We will use the ResNet18 trained in Homework 1. Please follow the instructions below to inflict this misclassification.

Choose 5 frog images (targets) from the CIFAR-10's test-set.
Choose 100 dog images (base images) from the CIFAR-10's test-set. (You will use them to craft poisons).
Use the 100 base images to craft 100 poisons for each targets. Please use your ResNet18 to extract features. (see the details below).
Construct 6 contaminated training sets for each target by injecting {1, 5, 10, 25, 50, 100} poisons into the original training data.
Finetune only the last layer of your ResNet18 for 10 epochs on each contaminated training set. Check if your finetuned model misclassifies each target (frog) as a dog. If the model misclassifies the target as a dog, your attack is successful. Otherwise, it's an attack failure.

In total, you will have 30 contaminated training sets (= 6 different sets x 5 targets).

Implementation

Here, you need to implement the following function in attacks/clabel.py.


                                def craft_clabel_poisons(model, target, bases, niter, lr, beta, device=None):

                                - model : a pre-trained ResNet18

                                - target: a target sample (a frog)

                                - bases : a set of base samples (dogs)

                                - niter : number of optimization iterations

                                - lr    : learning rate for your optimization

                                - beta  : hyper-parameter (refer to the paper)

                                // You can add more arguments if required

This function crafts clean-label poisons. It takes a model (ResNet18) to extract features for a single target and 100 base samples. It also takes optimization hyper-parameters such as niter, lr, beta, etc. Once the function sufficiently optimizes your poisons, it will return 100 poisons crafted from the bases. Please refer to the author's code, the community implementations, and the original study for reference.

To evaluate the clean-label attack, use eval_clabel.py. This script fine-tunes only the last layer of your ResNet18 on each contaminated training set and measures the attack success rate (ASR).

[Extra +3 pts]: Defeat Data Poisoning Attacks

In the lecture, we learned two simple defense mechanisms against data poisoning attacks: (1) RONI [Paper] and (2) Data sanitization [Link]. Here, we implement those defenses and use them against the two data poisoning attacks (random label-flipping and clean-label poisoning).

Subtask I: RONI against Random Label-flipping

Let's start with RONI. You will choose the MNIST-1/7 training set containing 20% poisons (i.e. 20% samples have flipped labels).

First, sub-sample 20% of any samples from the MNIST-1/7 training set. You will use this (D_v) to remove poisons from the training data.
Next, let's split the contaminated training set. You can divide the contaminated training data (D_tr) into multiple sets (D_tr_i, where i in [1, 170]) where each set contains 100 training samples (c.f. this process will create approximately ~170 different sets).
You first train a logistic regression on D_tr_1; compute the model's accuracy on D_tr_1 and save it.
Iteratively (from i=1, ..., 170), train your model on D_tr_1 + ... + D_tr_i. At each time, compare the i-th model's accuracy with the (i-1)-th model's. If the accuracy is reduced more than X% (a hyper-parameter of your choice), remove D_tr_i from the training set and continue.
Use at least two X% values and check how many poisons you removed in each case. You also need to check how the accuracy of your model is after removing suspicious samples (i.e., you will examine the effectiveness of RONI defense).

Subtask II: Data Sanitization against Clean-label Poisoning

Let's move on and defeat clean-label poisoning (Poison Frogs!). Please choose any successful attack (i.e., choose a target and 100 poisons).

We will use ResNet18 fine-tuned on the contaminated training set. Let's first compute features for all the training samples with the model.
Using the features (for 50k original training samples + 100 posions you add), we will detect suspicious samples and remove them from the training set. Please remove the outlier samples by running this UMAP example [link] on the collected features.
Let's sanitize the training set by removing outliers. Please remove 2-3 amounts (100, 200, or 300) and compose sanitized training sets.
Finetune your original ResNet18 on each sanitized training set and check whether the poisoning attack is successful.

Submission Instructions

Push your code to your GitHub Classroom repository to submit. Place your write-up as a PDF under the reports/ folder and push it with your code. Do NOT commit datasets or trained model checkpoints to the repository. Your PDF write-up should contain the following things:

Task I

Your plot: { the ratio of poisons in the training set } vs. { classification accuracy } on the test-set
Your analysis: write-down 2-3 sentences explaining why you see those results.

Task II

Your table: 2 rows (the upper one is for the number of poisons and the lower one is the number of successful attacks over 5 targets)
Your analysis: write-down 2-3 sentences explaining why you see those results.

[Extra +3 pts]

Sub-task I

Your plot: { # iterations } vs. { the accuracy of your model } on the test-set.
Your analysis: write-down 2-3 sentences explaining why you see those results.

Sub-task II

Your analysis: write-down 2-3 sentences explaining whether you successfully mitigate clean-label poisoning or not. (If possible) analyze whether you can defeat more successfully when removing more suspicious samples.

Must Read

This course uses GitHub Classroom. The invitation link is posted on Canvas (not here, to restrict access to enrolled students). Accept the assignment to get your personal repository — one repo is shared across all 4 homeworks, and you build on it throughout the term.
How to submit: Push your code to your repository — that is your submission. The autograder runs automatically on each push and tests your implementation.
- Datasets (MNIST, CIFAR-10, etc.): Do NOT commit. They are downloaded automatically at test time via download=True.
- Trained checkpoints (final.pth): You MUST commit and push these. The autograder loads them directly from your repository — if they are missing, your tests will fail immediately. After training, run git add checkpoints/*/final.pth and push. The .gitignore whitelist in your repo ensures only the required final.pth files are tracked and blocks everything else (e.g., epoch checkpoints, VGG16 weights which exceed GitHub's 100MB limit).
Write-up: Put your PDF report under the reports/ folder in your repository and push it together with your code.
Local testing: Run python -m pytest autograder/ -v to test your implementation before pushing.
Grades (autograder score + write-up review) will be posted on Canvas.

Homework Overview

The learning objective of this homework is for you to understand (1) a mechanism for measuring the privacy leakage of machine learning (ML) models and (2) a mechanism to bound the leakage while training ML models. The best way to understand those mechanisms is to implement them by your hands. Here, we will focus on membership inference attacks, especially the one proposed by Yeom et al., and the de-facto standard defense, differential privacy (DP). You can start this final homework from the codebase you used for HW 1-3, as usual.

Initial Setup

Datasets and DNN Models

We will use two datasets: FashionMNIST [link], and CIFAR-10 [link]. Note that we switch from MNIST to FashionMNIST as Yeom et al's attack will be less effective on MNIST. We consider two DNNs: LeNet [link] and ResNet18 [link]. You will train LeNet on FashionMNIST and ResNet18 on CIFAR-10.

Recommended Code Structure

You will write one new script mi_attack.py. DP-SGD training is integrated into the existing train.py via the --dp flag. The rest are the same as HW 1-3.


                                    Root

                                    - [New] mi_attack.py : a Python script to run membership inference attacks (Yeom et al's).

                                    - train.py (modified): use --dp --dp-epsilon <ε> to enable DP-SGD training.

                                    ...

Task I: Membership Inference Attacks on Machine Learning Models

Let's start with doing membership inference attacks formulated by Yeom et al. [link] on your DNN models. Your job is to implement this attack and evaluate its effectiveness. Please write your attack code to mi_attack.py.

Build Models: You will first train the victim models. Train LeNet on FashionMNIST for 50 epochs and ResNet18 on CIFAR-10 for 100 epochs. During training, you have two sub-tasks: (1) save a checkpoint every 10 epochs (i.e., at {10, 20, 30, 40, 50} for FashionMNIST and {10, 20, …, 100} for CIFAR-10) and (2) draw a plot { # epochs } vs. { the train and test losses }. In the end, we have 15 models (5 in FashionMNIST and 10 in CIFAR-10) and two plots where each describes the training and testing losses over epochs.

[Note] Please ensure you observe the max. accuracy of LeNet and ResNet18 models > 80% on both datasets during training.

Prepare Datasets for evaluating membership inference attacks. Your job is to construct a dataset to evaluate the attack's effectiveness. Typically, we choose a dataset with 10k samples, 5k chosen randomly from the training set (members) and the other 5k selected randomly from the testing set (non-members). You will compose two datasets from FashionMNIST and CIFAR10, respectively.

Perform Yeom et al. Attack: We will do Yeom et al. membership inference attack on the 15 models we train.

The first step is to implement the attack. It uses a threshold (loss value) to identify whether a sample that the adversary queries is a member (or a non-member). If the victim model's loss on a sample is lower than the threshold, the sample is a member. Otherwise, it's a non-member. You can compute this loss threshold as the mean loss over a small held-out subset of 100 training samples (see --n-threshold in mi_attack.py).

Suppose we have a threshold, we will use it to classify whether each of the 10k validation samples is a member (or a non-member). We will then compute the membership advantage over the 10k samples. The membership advantage is defined as: Adv = |TPR − FPR|, where TPR (true positive rate) is the fraction of members correctly identified as members, and FPR (false positive rate) is the fraction of non-members incorrectly identified as members. Your job is to run this attack process on all the 15 models we trained.

The final step is to create two plots using the collected membership advantages; each plot corresponds to FashionMNIST or CIFAR-10 results, respectively. In each plot, you will show { # epochs } vs. { membership advantage }.

Task II: Differential Privacy as a Defense against Membership Inference

Let's turn our attention to defenses. We will consider the standard defense, differentially-private stochastic gradient descent (DP-SGD), to defend against the Yeom et al. attacks [link]. Your job is to train models with DP-SGD and compare the attack success with the above results (no DP-SGD).

Train Models with DP-SGD: Thanks to the community's effort, we don't need to implement this mechanism from scratch. You can use the off-the-shelf libraries, such as Opacus [link] in PyTorch or TF-Privacy [link] in TensorFlow. You can incorporate DP-SGD into your current training script.

DP-SGD is already integrated into train.py. Enable it by passing --dp --dp-epsilon <ε>. Those examples [example1, example2] explain how Opacus wraps your training loop.

We will train our models with different privacy guarantees. Note that in DP-SGD, we control the hyper-parameter (epsilon) to bound a model's privacy leakage. We train 5 different models with epsilons in {1, 2, 4, 8, 10}. In total, we will have 10 models, 5 for FashionMNIST and 5 for CIFAR-10. Train each model using the same epoch count as in Task I (50 for FashionMNIST, 100 for CIFAR-10) and save the model with the best test acc.

[Note] Set all the training hyper-parameters to the same as those we use in Task I, except the learning rate. I recommend doubling the learning rate.

Run Yeom et al. on Models Trained with DP-SGD: We now run the attack formulated by Yeom et al. on the 10 models trained with DP-SGD. Compute the membership advantages.

The last step is to create two plots using the collected membership advantages; each plot corresponds to FashionMNIST or CIFAR-10, respectively. In each plot, you will show both { epsilon } vs. { membership advantage } and { epsilon } vs. { model's test accuracy } on the same axes (using a dual y-axis or two side-by-side subplots).

Submission Instructions

Push your code to your GitHub Classroom repository to submit. Place your write-up as a PDF under the reports/ folder and push it with your code. Do NOT commit datasets or trained model checkpoints to the repository. Your PDF write-up should contain the following things:

Task I

Your two plots: { # epochs } vs. { train and test loss} for FashionMNIST and CIFAR-10.
Your two plots: { # epochs } vs. { membership advantage } for FashionMNIST and CIFAR-10.
Your analysis: write-down 2-3 sentences explaining why you see those results.

Task II

Your two plots: { epsilon } vs. { test acc. and membership advantage } for FashionMNIST and CIFAR-10.
Your analysis: write-down 2-3 sentences explaining why you see those results.

CS499/599 | AI539 :: S26 :: Trustworthy Machine Learning

Home

Schedule

Critique, Presentation

Homework

Project

Homework 1: Build Your Own Models

Must Read

Homework Overview

Initial Setup

Datasets and DNN Models

Task I: Train and Evaluate Your Models

Task II: Analyze the Impact of Your Training Techniques on Models

Submission Instructions

Homework 2: Adversarial Attacks on Your Models

Must Read

Homework Overview

Initial Setup

Datasets and DNN Models

Recommended Code Structure

Note

Task I: Attack Your Models

Task II: Analyze the Impact of Several Factors on Your Attack's Success Rate

Subtask II-1: Analyze the Impact of Attack Hyper-parameters

Subtask II-2: Analyze the Impact of the Training Techniques You Use

Task III: Defend Your Models with Adversarial Training

[Extra +3 pts]: Use Your Adversarial Examples to Attack Real-world DNNs

Submission Instructions

Homework 3: Data Poisoning Attacks and Defenses

Must Read

Homework Overview

Initial Setup

Datasets

Models

Recommended Code Structure

Task I: Poisoning Attack against Logistic Regression Models

Task II: Poisoning Attacks on Deep Neural Networks

Instructions

Implementation

[Extra +3 pts]: Defeat Data Poisoning Attacks

Subtask I: RONI against Random Label-flipping

Subtask II: Data Sanitization against Clean-label Poisoning

Submission Instructions

Homework 4: Membership Inference and Differential Privacy

Must Read

Homework Overview

Initial Setup

Datasets and DNN Models

Recommended Code Structure

Task I: Membership Inference Attacks on Machine Learning Models

Task II: Differential Privacy as a Defense against Membership Inference

Submission Instructions

Thanks for your hard-working!