Skip to content

Fixes for labelling scripts#3

Open
mlanvin wants to merge 1 commit intoGintsEngelen:mainfrom
mlanvin:main
Open

Fixes for labelling scripts#3
mlanvin wants to merge 1 commit intoGintsEngelen:mainfrom
mlanvin:main

Conversation

@mlanvin
Copy link
Copy Markdown

@mlanvin mlanvin commented Aug 23, 2023

Fix 1:

In [1], we discovered that many packets were not recorded in the right order in the CICIDS2017 dataset. This often occurs when massive attacks are performed like DoS. This produces an inversion of the flow description, and the source and destination data are inverted. Therefore, there is no guarantee of the direction of the flow. When it comes to labelling we thus cannot be sure that the source and destination addresses/ports are not inverted. The solution we chose was to label the network flows using both possibilities. For instance, if the labeling filter is (192.168.10.3, 88, 192.168.10.8, 49173, 6, 06/07/2017 14:48:12) as (ip_src, port_src, ip_dst, port_dst, protocol, timestamp) we advise to also label the flows matching the following filter : (192.168.10.8, 49173, 192.168.10.3, 88, 6, 06/07/2017 14:48:12) to make sure to label all the malicious flows and not to miss any of them due to the flow description inversion.


Fix 2:

In addition, an unintended behaviour existed in the labelling scripts. There is a step to retype the columns of the CSV produced by CICFlowMeter. Before this step, the lines containing NaN values were dropped. But this deletion should be removed because valid flows were also dropped, and then these flows were missing for the labelling step.

We could observe that almost half of the port scan-related flows were not labelled as such because of the unintended drop of the flows.

More specifically, the NaN values occur for the following network features: 'Flow Bytes/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min'.
These network features are legitimately empty when a network flow includes a single packet (port scan, for example). Then, the Inter Arrival Time (IAT) statistics are not defined; however, the flow should be kept.

I set the value -1 for these features to avoid NaN values and allow the column type redefinition step. This value has no reality, clearly indicating that the value was missing.


This pull request includes the following:

  • Fix 1 - the consideration of the flow description inversion (which comes from an incoherent timestamp issue described in [1]) in the labelling scripts for CICIDS2017 and CSE-CIC-IDS2018.
  • Fix 2 - the removal of the deletion of the lines containing NaN values (which affected the labels) in the labelling scripts for CICIDS2017 and CSE-CIC-IDS2018.

References :

[1] Lanvin, M., Gimenez, PF., Han, Y., Majorczyk, F., Mé, L., Totel, É. (2023). Errors in the CICIDS2017 Dataset and the Significant Differences in Detection Performances It Makes. In: Kallel, S., Jmaiel, M., Zulkernine, M., Hadj Kacem, A., Cuppens, F., Cuppens, N. (eds) Risks and Security of Internet and Systems. CRiSIS 2022. Lecture Notes in Computer Science, vol 13857. Springer, Cham. https://doi.org/10.1007/978-3-031-31108-6_2

@lisa-lthorrold
Copy link
Copy Markdown

Hi @mlanvin , @GintsEngelen ,

I am working through the causes of the single packet flows which are causing that NaN values, and am working to see if we can reduce them. In some cases, they can be reduced, and in other cases, they are single packet flows. I have considered your suggestion, and would like to counter-propose and get some additional comments. I think it, to mirror reality a bit closer:

''Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min' can be set to 0, which no adverse applications. For 'Flow Bytes/s', there are some options:

  1. Set to -1 (Your suggestion - which could be problematic, since pumping negative numbers in neural networks is not optimal, without further transformation of the column, and this can introduce inconsistencies since it's another step different researchers will need to think about how to address)

  2. Set to 0 (I think this probably aligns with most with reality, since in a single packet flow, the concept of flow rate is not defined), and which also aligns with 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min' if they are to be set to 0 for single packet flows

  3. Set to ‘Total Number of Bytes’ if it is a single packet flow (i.e. assume total duration of 1 second)

If you have any thoughts towards the above, that would be greatly appreciated!

@GintsEngelen
Copy link
Copy Markdown
Owner

Hi @lisa-lthorrold

For me option 2 (set to 0) seems the most logical - I hadn't considered the fact that -1 might pose issues when using Neural Nets.

@GintsEngelen
Copy link
Copy Markdown
Owner

Hi @mlanvin

Thanks again for your findings! Indeed we confirmed that there are some TCP connections where the SYN-ACK packet seemingly arrives before the SYN packet in the PCAP file (due to being erroneously recorded), and given that the CICFlowMeter determines the flow direction based on the direction of the first packet, this causes these particular flows to go in the direction that is opposite of the TCP connection (sender becomes receiver and vice-versa).

This is very much an artefact of this dataset, and my fear is that if we simply include this additional labelling logic without addressing the core issue, we end up with these malicious reversed flows that are correctly labelled, but which still flow in reverse. For a Machine Learning system, I fear that this would cause confusion, given that for all of those reversed flows, the Fwd and Bwd bytes are in fact all swapped, and are not in line with the normal, non-reversed flows.

In the end I decided to address this issue in the CICFlowMeter in this commit, so that we don't have to deal with reversed flows during the labelling step at all. For this Pull Request, Fix 1 will thus not be required anymore.

@mlanvin
Copy link
Copy Markdown
Author

mlanvin commented Sep 18, 2023

Hi @mlanvin , @GintsEngelen ,

I am working through the causes of the single packet flows which are causing that NaN values, and am working to see if we can reduce them. In some cases, they can be reduced, and in other cases, they are single packet flows. I have considered your suggestion, and would like to counter-propose and get some additional comments. I think it, to mirror reality a bit closer:

''Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min' can be set to 0, which no adverse applications. For 'Flow Bytes/s', there are some options:

1. Set to -1 (Your suggestion - which could be problematic, since pumping negative numbers in neural networks is not optimal, without further transformation of the column, and this can introduce inconsistencies since it's another step different researchers will need to think about how to address)

2. Set to 0 (I think this probably aligns with most with reality, since in a single packet flow, the concept of flow rate is not defined), and which also aligns with 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min' if they are to be set to 0 for single packet flows

3. Set to ‘Total Number of Bytes’ if it is a single packet flow (i.e. assume total duration of 1 second)

If you have any thoughts towards the above, that would be greatly appreciated!

Hi @lisa-lthorrold

For the choice between the three options, I think it would be better to have a value out of the normal range of the data. Because if we set 0, it would mean that no data have been exchanged which might be false and if we use the ‘Total Number of Bytes’ this could make the caracteristics of flows having a single packet and flows with two packets look similar regarding the 'Flow Bytes/s' whereas they wouldn't be so.
In addition, I believe it wouldn't be a a good practice to use a neural network directly on the raw data without preprocessing them. We could expect ML people to perform data transformation or data encoding before using any model.

@mlanvin
Copy link
Copy Markdown
Author

mlanvin commented Sep 18, 2023

Hi @mlanvin

Thanks again for your findings! Indeed we confirmed that there are some TCP connections where the SYN-ACK packet seemingly arrives before the SYN packet in the PCAP file (due to being erroneously recorded), and given that the CICFlowMeter determines the flow direction based on the direction of the first packet, this causes these particular flows to go in the direction that is opposite of the TCP connection (sender becomes receiver and vice-versa).

This is very much an artefact of this dataset, and my fear is that if we simply include this additional labelling logic without addressing the core issue, we end up with these malicious reversed flows that are correctly labelled, but which still flow in reverse. For a Machine Learning system, I fear that this would cause confusion, given that for all of those reversed flows, the Fwd and Bwd bytes are in fact all swapped, and are not in line with the normal, non-reversed flows.

In the end I decided to address this issue in the CICFlowMeter in this commit, so that we don't have to deal with reversed flows during the labelling step at all. For this Pull Request, Fix 1 will thus not be required anymore.

Hi @GintsEngelen

I agree with you it would be better to handle it somewhere else than in the labelling script, however it is very difficult to do so.
I read your commit and I think it doesn't fix the issue in the case of port scan for instance. Indeed, the attacker can send a SYN/ACK flag however the direction of the flow will be correct and should not be reversed. If the first packet of a flow is a SYN/ACK, it is not always an issue with the packets recording order, it can also be the behavior of an attacker. (Port scan example: https://nmap.org/book/idlescan.html). Then labelling the flows in the two directions seems to be needed to avoid these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants