Mistakes were Made, but not by Me

Aim

This is a small (<30 lines of code) text (or filename) cleaning task requiring the use of regex or/and fuzzy pattern matching.

In last year's Data Mining 2 class I asked students to submit two files as part of a particular assignment. The files were to be named 01-Clean.ipynb and 02-Model.ipynb.

Like all students — except you of course — they decided to ignore the instructions and do their own thing and use different file names or some even decided to "help" by compressing files :-).

So I had to write code to redress this situation, and want you to go through the same pain.

Details

Download the following file tree I have renamed folders and only have empty files to protect the innocent/guilty. So we have a file tree like:

nasa
├── Student-00
│   ├── 01-Clean.ipynb
│   └── 02-Model.ipynb
├── Student-01
│   ├── 01-Clean.ipynb
│   └── 02-Model.ipynb
├── Student-02
│   └── NASA-Software-Defect-Assignment.zip
├── Student-07
│   ├── clean.ipynb
│   └── model.ipynb
...
├── Student-14
│   └── NASA_ASSIGNMENT.zip
...
├── Student-18
│   ├── 01-Clean.ipynb
│   └── 02-ModelVersion2.ipynb
├── Student-19
│   ├── 01-Clean.ipynb
│   └── 02-Model.ipynb
├── Student-20
│   └── 02-Model.ipynb
├── Student-21
│   ├── 01-Clean.ipynb
│   └── 02-model.ipynb
├── Student-22
│   └── NasaSoftwareDefection.zip
...
└── Student-35
    ├── 01\ -\ Cleaning.ipynb
    └── 02\ -\ Pipeline.ipynb

Parse this tree to produce dataframe similar to that shown below

Result of cleaning NASA assignment submission files.

Where:

For each student/folder try to determine which file should have been called 01-Clean.ipynb and which file should have been called 02-Model.ipynb. Use regex or fuzzy matching for this - NOT exact string matching!
Aim for a generic matching as possible, i.e., considers next year's class - who are going to listen to me even less - and have other/more variations.
In the generated dataframe:
- Columns clean and model store matched filename or, if unmatched store empty string.
- Column message stores one of the following values (conditions tested in given order):
  - ARCHIVE if an archive (bz2, gz, rar, zip, 7z) was uploaded.
  - OK if the two uploaded files names matched given specification (including case), or uploaded files have been identified.
  - UNKNOWN if there is a file that cannot be assigned.