Reading duration: 9 min

Our century has been dubbed the information age, in which data has become both the most popular product and the most powerful lever of control. Those with access to insider information about a company have the opportunity to make profitable investments. If records are stolen and fall into the darknet, it may be bad news for the leaders of such a business. Therefore, concerns about data protection and software systems security are at the forefront of threats companies must deal with.

Some statistics

According to world statistics, 7.9 billion records were stolen within the first nine months of 2019, up 33% from the same period in 2018. The main problem is that it takes most companies approximately 6 months to understand that their systems have been breached. Thus, statistics come about half a year too late. Researchers predict that hackers will steal about 33 billion records by 2023, and the annual cost of preventing cybercrime will rise to $6 trillion. This cost includes loss and corruption of records, stolen money, theft of intellectual property and confidential documents, obstacles to normal business functioning, damage to reputation and so on.

The challenge is compounded by the fact that criminals are constantly one step ahead. In practical terms, cybersecurity services discover what has already been hacked and try to mitigate the negative consequences. Preventive measures are not yet so effective: while security is busy trying to find and close all potential holes, relying on previous experience, cyber criminals are inventing a new loophole. And this is where data science can help us.

What is data science?

But what is data science, and why is it among the most promising technologies? Data science is a branch of computer science that studies the problems of analysis, processing and presentation of data in digital form. It helps to detect patterns and gain valuable insights from both structured and unstructured records. Based on the results of this analysis, a data analyst can make conclusions and decide on next steps. The bigger the size of the input, the more patterns and intercorrelations can be found. That’s why data science goes hand in hand with big data. To make the process efficient, data specialists use machine learning (ML) algorithms. Its subset — deep learning (DL) algorithms — provides the ability to recognize patterns of patterns and makes the system self-learning.  

Data science has quickly found many applications in the real world: natural language processing, chatbots, recommendation engines and image recognition, for example. The last, image recognition, has become especially relevant for safety during the coronavirus pandemic: computer vision copes better with the task of recognizing people wearing face masks than human beings do. However, the benefits of using data science for cybersecurity are not so obvious. Let’s consider why.

Cyber crimes vs security

To hack a system, cyber criminals use a wide set of specific tools and programming techniques, such as:   

  • Various kinds of malware (short for “malicious software”) or viruses. If such software gets into the infrastructure, the intruder will have the ability to enter the system and steal company records.  
  • Rootkit. This is special low-level software that provides criminals with administrator privileges.
  • A backdoor, or some hole in the authentication mechanism, that allows a hacker to gain access to the system by bypassing the security perimeter.
  • Bots (and botnets) that can run an abnormal number of tasks. Bots typically are used for denial of service (DoS) and distributed denial of service (DDoS) attacks.
  • Various sets of scanners and sniffers, which allow for detecting weaknesses in software and networks and further exploiting the found vulnerabilities. 
  • Login attack and account takeover (ATO), with help of rainbow tables or social engineering approaches.
  • Phishing (or masquerading) with the help of harmful middleware, which is usually a substitute for the real service.

Many companies still use a traditional set of methods to protect their infrastructure, such as anti-virus software, firewalls, monitoring tools, network traffic encryption and physical placement of records in “reliable” storage. However, as real-world experience shows, absolutely reliable repositories do not exist, and cracking encrypted data is only a matter of time.

Data science as an assistant in cybersecurity activities

Before the data science era, all solutions related to software and network security were based on the assumptions and subjective experience of specific specialists. As a result, such decisions were not always correct, since they were based on fear, uncertainty, and doubt (often shortened to FUD).

Data science, coupled with machine learning methods, made it feasible to move away from subjectivity, to use global expertise in this field and, as a result, reduce the likelihood of someone hacking into the company infrastructure. This becomes possible because a large amount of accumulated historical data (system and network logs) can be processed rather quickly by self-learning analytical software. Such analysis can also be supplemented with predictive data, which in turn can be generated with the help of ML algorithms. In addition, engineers have the ability to test the system for stability and security using various subsets of records, simulating hacking. The latter might also be produced automatically.

The main advantage of data analysis using ML algorithms is that it results in impartial conclusions. A machine-trained solution can reveal both the problems people generally knew about but forgot to blacklist and anomalies in data that no one had thought about so far, but which represent potential vulnerabilities in the future.

Where data science can benefit

According to Gartner’s PPDR model, all security activities can be categorized as prediction, prevention, detection or response. Data science may be used to detect earlier and even predict a cyberattack, with the help of the following approaches:

1. Network traffic analysis

Nowadays, hackers are able to use standardized communication protocols to breach not only software systems but also smart hardware devices united together. ML algorithms can analyze all events in the network and classify them as either malicious or legitimate. For instance:

  • Are transferred files “clean” or could they contain malware?
  • Are inbound requests formed properly, or do they have some unusual parts?
  • Are outbound requests allowed, or are they possibly sent to a suspicious endpoint?
  • Is this the normal (allowed) number of requests per second / minute from the same IP?

The machine-trained solution is able to detect automatically:

  • Suspicious logins under the same account from different regions and different devices
  • Attempts to gain information without proper authorization
  • Spam that may represent a security leak, since such emails can contain fishy links and invitations for various surveys where users are prompted to provide their personal data

All this gathered information, along with the conclusions drawn from it, can be used to train antiviruses and to configure firewalls, IP-tables and other security tools more precisely. 

2. Malware and hacker activity analysis

To build a strong defense, you need to know your enemy. Antivirus developers collect a lot of info about malware functionality, purpose and potential impact on the target system. Leading security specialists study the darknet and monitor well-known hacker communities, and some big corporations even hire hackers to test their own infrastructure for vulnerabilities. The result of all these activities is, again, data. Machine learning approaches can increase the processing speed of such input by automating some operations. Moreover, deep learning algorithms are able to detect potential holes in the target systems that are not obvious even to the malware creator.

3. Anomalies detection

Even if the company has an active cybersecurity department, on average it takes about 2 weeks to discover the system was breached. However, machine learning algorithms are able to help detect the intruder earlier. Each cyberattack leaves a specific trace behind, such as unexpected events or anomalies. Continuous analysis of system logs can discover attackers at a very early stage. 

4. Behavioral analysis

When the attacker gained access to the software in the “usual” way (for example, learned the username and password somehow), such a violation can be detected due to the user’s unusual behavior. User and Entity Behavior Analysis (UEBA) methods are applied to detect anomalies in data and user flows. This approach utilizes the same algorithms as recommendation engines — somewhat like the kind a Netflix or YouTube subscriber uses to get advice on what to watch next. But whereas the Netflix system  deals with regularly repeated user actions, in the case of a hack, attention is paid to deviations from the usual course of action. A good example of behavioral analysis is the automatic fraud detection implemented by some banks: during a call, the customer’s speech, intonation and frequently used phrases are checked for genuineness.  

5. Data protection with Associate Rule Learning

To protect the database from hacking, it might be fully encrypted, along with all requests to and from it. But this approach can significantly slow down the system. It’s wiser to secure only sensitive information. But how to detect which records are critical and which might be kept without encryption? Associate Rule Learning (ARL) is a machine learning method for discovering relationships between items in large and distributed repositories. ARL methods are able to discover critical data based on existing intercorrelations and recommend appropriate protection measures.

Of course, data science itself can’t be called a magic bullet for cybersecurity. However, employing it might greatly simplify the activities of the security department and increase the safety of your infrastructure. It should be mentioned as well that hackers also use data science and ML to find new ways of breaching. Therefore, the main task of company leaders is to hold the lead in this cat-and-mouse game.

What’s next?

If you are interested in this subject, get in touch with us. Our team would be happy to provide you with data science consulting as well as a step-by-step plan for enhancing the cybersecurity in your environment.