THM: Advent of Cyber 2023 - Day 14 - The Little Machine That Wanted to Learn

Advent of Cyber 2023 - This article is part of a series.

Part 14: This Article

Part 15: THM: Advent of Cyber 2023 - Day 15 - Jingle Bell SPAM: Machine Learning Saves the Day!

Part 16: THM: Advent of Cyber 2023 - Day 16 - Can't CAPTCHA this Machine!

Part 17: THM: Advent of Cyber 2023 - Day 17 - I Tawt I Taw A C2 Tat!

Part 18: THM: Advent of Cyber 2023 - Day 18 - A Gift That Keeps on Giving

Part 19: THM: Advent of Cyber 2023 - Day 19 - CrypTOYminers Sing Volala-lala-latility

The fourteenth day of AoC23 contains of a Machine Learning task.
We’re going to create an “AI” :)

Learning Objectives
#

What is machine learning?
Basic machine learning structures and algorithms
Using neural networks to predict defective toys

Overview
#

This is my key takeaways from today:

There are several types of Machine Learning structures.
There are several layers in the structures.
Working with good data is a must, Shit-In, Shit-Out.

Basics of AI (ML)
#

The term “AI” is used everywhere these days, often incorrectly. A better term is to use machine learning (ML). It referes to the process used to create a system that can mimic the behaviour we see in real life.
The field is incredibly broad, but this is a couple of popular examples:

Genetic algorithm: ML structure that aims to mimic the process of natural selection and evolution. By using rounds of offspring and mutations based on the criteria provided, the structure aims to create the “strongest children” through “survival of the fittest”.
Particle swarm: ML structure that aims to mimic the process of how birds flock and group together at specific points. By creating a swarm of particles, the structure aims to move all the particles to the optimal answer’s grouping point.
Neural networks: ML structure that is by far the most popular and aims to mimic the process of how neurons work in the brain. These neurons receive various inputs that are then transformed before being sent to the next neuron. These neurons can then be “trained” to perform the correct transformations to provide the correct final answer.

There are many more ML structures but we’ll stick with neural networks for this task.

Learning Styles
#

There are many different styles and subsets of styles, this will focus on two main styles for now:

Supervised learning: We guide the neural network to the answers we want it to provide. We ask the neural network to give us an answer and then provide it with feedback on how close it was to the correct answer. In this learning style, we need a dataset where we know the correct answers. This is called a labelled dataset, as we have a label for what the correct answer should be, given the input.
Unsupervised learning: We take a bit more of a hands-off approach and let the neural network do its own thing. The main goal is to have the neural network identify “interesting things”. Unsupervised learning is often used to allow neural networks to learn interesting features that humans can’t comprehend that can be used for classification.

In todays task, we focus on supervised learning.

Basic Structure
#

A neural network consists of various different nodes (neurons) that are connected.
It has three main layers:

Input layer: First layer of nodes in the neural network. The nodes each receive a single data input that is passed on to the hidden layer. The humber of nodes in this layer always match the network’s number of inputs (or data parameters). For example, if our network takes the toy’s length, width and height, there will be three nodes in the input layer.
Output layer: Last layer of nodes in the neural network. The nodes send the output from the network once it has been received from the hidden layer. The number of nodes in this layer will always be the same as the network’s number of outputs. For example, if our network outputs wheter or not the toy is defective, we will have one node in the output layer for either defective or not defective.
Hidden layer: This layer of nodes is between the input and output layers. This can be several layers to create a deep neural network. This layer is where the neural network’s main action takes place. Each node within the neural network’s hidden layer receives multiple inputs from the nodes in the previous layer and will then transmit their answers to multiple nodes in the next layer.

If we zoom in on one of the nodes inside the hidden layer:

In essence, the node receive input from nodes in the previous layer, adding them together and then sending the output on to the next layer of nodes.
There is a little bit more detail in this step that’s important to note:

Inputs are not directly added. They are multiplied by a weight value first. This helps the neural network decide which inputs should contribute more to the output than others.
The addition’s output is not directly transmitted out. The output is first entered into what is called an activation function. This decides if the neuron (node) will be active or not. It does this by ensuring that the output, no matter the input, will always be a decimal between 0 and 1 (or between -1 and 1).

There are two steps to training the network: the feed-forward step and the back-propagation step:

Feed-Forward Loop
#

How we send data through the network and get an answer on the other side. Once our network is trained, this is the only step we perform. At this point, we stop training and simply want an answer. To complete one round of the feed-forward step:

Normalise all of the inputs: Each node of the network tries to keep its answer between 0 and 1. If we have one input with a range of 0 to 50 and another with a range of 0 to 2, our network won’t be able to properly consume the input. We have to normalise the input by adjusting them so that their ranges are all the same. We can take the inputs with a 0 to 50 range and divide all of them by 25 to change their ranges to 0 to 2.
Feed the inputs to our nodes in the input layer: Once normalised, we can provide one data entry for each input node in our network.
Propagate the data through the network: At each node we add all the inputs and run them through the activation function to get the node’s output. This output becomes the input for the next layer of nodes. Repeat this until we get our network’s output layer.
Read the output from the network: The answer will be a decimal between 0 and 1. But, for decision-making, we’ll round it to get a binary answer from each output node.

Back-Propagation
#

When we’re training the network, the feed-forward loop is only half the process. When we recieve the answer from our network, we need to tell it how close it was to the correct answer. We perform the following steps:

Calculate the difference in received outputs vs expected outputs: The activaation function will provide a decimal answer between 0 and 1. We can calculate the difference in the answer that will tell us how close the neural network was to the correct answer.
Update the weights of the nodes: Using the difference calculated in #1, we can start to update the weights of each input to the nodes in the output layer.
Propagate the difference back to the other layers: Once the weights of the nodes in the output layer have been updated, we can calculate what the difference would be for the previous nodes. We continue this process until the weights for the input layer have been updated.

Once all the weights have been updated, we can run another sample of data through our network. We reppeat this process with all our samples in order to train our network.

Dataset Splits
#

Let’s say your teacher constantly tells you that 1+1 = 2 and 2+2 = 4. But, in the exam, your teacher asks you to calculate 3+3:

Have you just learned what the answer is, or did you learn the fundamental principle required to get to the answer?

You can overtrain yourself by learning the answers instead of learning the required principle itself, and it’s the same with neural networks!

We are training the network with data where we know the answers, so it’s possible for the network to simply learn the answers and not how to calculate the answer. We need to validate that our neural network is learning the process and not the answers. We have to split our dataset into three datasets to combat this:

Training data: This is our largest dataset. We use it to train the network, usually it’s about 70-80% of the original dataset.
Validation data: After each training round, we send this data through our network to determine its performance. If the performance starts to decline, we know we’re starting to overtrain and should stop the process. Usually it’s 10-15% of the original dataset.
Testing data: The network won’t see this data at all until we are done with the training process. Once training is complete, we send through the testing dataset to determine the performance of our network. Usually it’s 10-15% of the orginal dataset.

Putting it all together
#

Time to build our own neural network! We will work with 3 files:

detector.py - The script to build our neural network (some modification made by me from the original from THM).
training_dataset.csv - In this dataset, the elves have not only captured the measurements of the toys but also whether the toy was defective or not. The dataset is used to train, validate and test the model.
testing_dataset.csv - In this dataset, the elves have only captured the measurements of the toys. Once we trained our neural network, we will predict which of the entries in the file are defective toys.

Each of the files can be downloaded from my forgejo: git.eplots.io

#These are the imports that we need for our Neural Network
#Numpy is a powerful array and matrix library used to format our data
import numpy as np
#Pandas is a machine learning library that also allows for reading and formatting data structures
import pandas as pd
#This will be used to split our data
from sklearn.model_selection import train_test_split
#This is used to normalize our data
from sklearn.preprocessing import StandardScaler
#This is used to encode our text data to integers
from sklearn.preprocessing import LabelEncoder
#This is our Multi-Layer Perceptron Neural Network
from sklearn.neural_network import MLPClassifier

#These are the colour labels that we will convert to int
colours = ["Red", "Blue", "Green", "Yellow", "Pink", "Purple", "Orange"]


#Read the training and testing data files
training_data = pd.read_csv("training_dataset.csv")
training_data.head()

testing_data = pd.read_csv("testing_dataset.csv")
testing_data.head()

#The Neural Network cannot take Strings as input, therefore we will encode the strings as integers
encoder = LabelEncoder()
encoder.fit(training_data["Colour Scheme"])
training_data["Colour Scheme"] = encoder.transform(training_data["Colour Scheme"])
testing_data["Colour Scheme"] = encoder.transform(testing_data["Colour Scheme"])

#Read the data we will train on
X = np.asanyarray(training_data[['Height','Width','Length','Colour Scheme','Maker Elf ID','Checker Elf ID']])
#Read the labels of our training data
y = np.asanyarray(training_data['Defective'].astype('int'))

#Read our testing data
test_X = np.asanyarray(testing_data[['Height','Width','Length','Colour Scheme','Maker Elf ID','Checker Elf ID']])

#This will split our training dataset into two with a 80/20 split
train_X, validate_X, train_y, validate_y = train_test_split(X, y, test_size=0.2)

print ("Sample of our data:")
print("Features:\n{}\nDefective?:\n{}".format(train_X[:3], train_y[:3]))

#Normalize our dataset
scaler = StandardScaler()
scaler.fit(train_X)

train_X = scaler.transform(train_X)
validate_X = scaler.transform(validate_X)
test_X = scaler.transform(test_X)

print ("Sampe of our data after normalization:")
print("Features:\n{}\nDefective?:\n{}".format(train_X[:3], train_y[:3]))

#Create our classifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15, 2), max_iter=10000)

print ("Starting to training our Neural Network")

#Train our classifier
clf.fit(train_X, train_y)

#Validate our Neural Network
y_predicted = clf.predict(validate_X)

#This function tests how well your Neural Network performs with the validation dataset
count_correct = 0
count_incorrect = 0
for x in range(len(y_predicted)):

    if (y_predicted[x] == validate_y[x]):
        count_correct += 1
    else:
        count_incorrect += 1

print ("Training has been completed, validating neural network now....")
print ("Total Correct:\t\t" + str(count_correct))
print ("Total Incorrect:\t" + str(count_incorrect))

accuracy =  ((count_correct * 1.0) / (1.0 * (count_correct + count_incorrect)))

print ("Network Accuracy:\t" + str(accuracy * 100) + "%")

print ("Now we will predict the testing dataset for which we don't have the answers for...")

#Make prediction on the testing data that was not labelled by the elves
y_test_predictions = clf.predict(test_X)

#This function will save your predictions to a textfile that can be uploaded for scoring
print ("Saving predictions to a file")

output = open("predictions.txt", 'w')

for value in y_test_predictions:
    output.write(str(value) + "\n")

print ("Predictions are saved, this file can now be uploaded to verify your Neural Network")
output.close()

We then save the predictions to a file (filename) and upload it to a site. After one run, I got 91.43% accuracy and therefore got a flag.

This raise the question why the accuracy can fluctuates. The reason is that neural networks have randomness built into them. The weights for each of the inputs of the nodes are randomised at the start, which means that two neural networks are never exactly the same.

Other factors is, for example, the quality of the dataset. Shit-iun, shit-out…

CyberSec applications for ML
#

ML structures are incredible at finding complex patterns in data and performing predictions on large datasets with incredible accuracy. This can be used for classifications that are complexx, such as whether network traffic is malicious or not.
ML structures are incredibly good at anomaly detection. This can be used in security to detect anomalies such as unauthrosied account logins.
ML structures have the ability to learn complex patterns and can be used for authentication applications such as biometric authentications. It can be used to predict whether a person’s fingerprint or iris matches the template that has been stored to provide access to buildings or devices.

CyberSec cautions for ML
#

ML is inherently imperfect. The answers given by a neural network is called “prediction” for a very good reason. It’s just that, a prediction. It’s impossible for 100% of the predictions to be correct. We should remember that AI isn’t the silver bullet for all problems!
The power that allows ML to be used for defence means that it can also be used for offence.