Conversation
|
Hi @mlanvin , @GintsEngelen , I am working through the causes of the single packet flows which are causing that NaN values, and am working to see if we can reduce them. In some cases, they can be reduced, and in other cases, they are single packet flows. I have considered your suggestion, and would like to counter-propose and get some additional comments. I think it, to mirror reality a bit closer: ''Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min' can be set to 0, which no adverse applications. For 'Flow Bytes/s', there are some options:
If you have any thoughts towards the above, that would be greatly appreciated! |
|
For me option 2 (set to 0) seems the most logical - I hadn't considered the fact that -1 might pose issues when using Neural Nets. |
|
Hi @mlanvin Thanks again for your findings! Indeed we confirmed that there are some TCP connections where the SYN-ACK packet seemingly arrives before the SYN packet in the PCAP file (due to being erroneously recorded), and given that the CICFlowMeter determines the flow direction based on the direction of the first packet, this causes these particular flows to go in the direction that is opposite of the TCP connection (sender becomes receiver and vice-versa). This is very much an artefact of this dataset, and my fear is that if we simply include this additional labelling logic without addressing the core issue, we end up with these malicious reversed flows that are correctly labelled, but which still flow in reverse. For a Machine Learning system, I fear that this would cause confusion, given that for all of those reversed flows, the Fwd and Bwd bytes are in fact all swapped, and are not in line with the normal, non-reversed flows. In the end I decided to address this issue in the CICFlowMeter in this commit, so that we don't have to deal with reversed flows during the labelling step at all. For this Pull Request, Fix 1 will thus not be required anymore. |
For the choice between the three options, I think it would be better to have a value out of the normal range of the data. Because if we set 0, it would mean that no data have been exchanged which might be false and if we use the ‘Total Number of Bytes’ this could make the caracteristics of flows having a single packet and flows with two packets look similar regarding the 'Flow Bytes/s' whereas they wouldn't be so. |
I agree with you it would be better to handle it somewhere else than in the labelling script, however it is very difficult to do so. |
Fix 1:
In [1], we discovered that many packets were not recorded in the right order in the CICIDS2017 dataset. This often occurs when massive attacks are performed like DoS. This produces an inversion of the flow description, and the source and destination data are inverted. Therefore, there is no guarantee of the direction of the flow. When it comes to labelling we thus cannot be sure that the source and destination addresses/ports are not inverted. The solution we chose was to label the network flows using both possibilities. For instance, if the labeling filter is (192.168.10.3, 88, 192.168.10.8, 49173, 6, 06/07/2017 14:48:12) as (ip_src, port_src, ip_dst, port_dst, protocol, timestamp) we advise to also label the flows matching the following filter : (192.168.10.8, 49173, 192.168.10.3, 88, 6, 06/07/2017 14:48:12) to make sure to label all the malicious flows and not to miss any of them due to the flow description inversion.
Fix 2:
In addition, an unintended behaviour existed in the labelling scripts. There is a step to retype the columns of the CSV produced by CICFlowMeter. Before this step, the lines containing NaN values were dropped. But this deletion should be removed because valid flows were also dropped, and then these flows were missing for the labelling step.
We could observe that almost half of the port scan-related flows were not labelled as such because of the unintended drop of the flows.
More specifically, the NaN values occur for the following network features: 'Flow Bytes/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min'.
These network features are legitimately empty when a network flow includes a single packet (port scan, for example). Then, the Inter Arrival Time (IAT) statistics are not defined; however, the flow should be kept.
I set the value -1 for these features to avoid NaN values and allow the column type redefinition step. This value has no reality, clearly indicating that the value was missing.
This pull request includes the following:
References :
[1] Lanvin, M., Gimenez, PF., Han, Y., Majorczyk, F., Mé, L., Totel, É. (2023). Errors in the CICIDS2017 Dataset and the Significant Differences in Detection Performances It Makes. In: Kallel, S., Jmaiel, M., Zulkernine, M., Hadj Kacem, A., Cuppens, F., Cuppens, N. (eds) Risks and Security of Internet and Systems. CRiSIS 2022. Lecture Notes in Computer Science, vol 13857. Springer, Cham. https://doi.org/10.1007/978-3-031-31108-6_2