BCI tools and techniques such as signal acquisition, signal processing, feature extraction, **Machine Learning algorithms, **and classification techniques contribute to the development and improvement of BCI technology.

Almost all BCI systems contain a Machine Learning algorithm as a central part**. **It learns from training data and derives a function that can be used **to discriminate different models** of brain activity. It adapts the BCI system to the brain of a particular subject. This reduces the imposed learning load in this regard. For simplicity and practical reasons, Machine Learning algorithms are usually divided into two modules: extraction and classification of features.

The** feature extraction module **transforms brain signals into a representation that makes the classification simpler, usually data arrays. The goal of extraction is also to** remove noise** and other unnecessary information from input signals while maintaining important information to discriminate between different classes of signals.

Feature vectors are extracted from brain signals by signal processing methods. The neurophysiological knowledge of the subject in question can help to decide which characteristic of the brain signal is to be expected. This is fundamental for deciding the most discriminating information for the chosen paradigm. These features are translated into** a control signal **by Machine Learning algorithms**.**

Classification is a challenging problem due to the low signal-to-noise ratio of EEG signals, the variance of EEG signals for a given subject and the variance between different subjects. **Pre-processing techniques** serve to improve EEG signals and remove noise and extract relevant features.

## Data preparation

Before being able to apply any Machine Learning methodology, it is necessary to go through some steps that will allow us to transform the signals into tractable data samples.

After a pre-processing phase, the data is provided in the form of a signal. Now it is necessary to separate the signal provided through two sets, of **training and testing**, in single segments. This is achieved by extracting the data samples between 0-650 ms after the start of each intensification (during the pre-processing phase each significant signal is intensified). Knowing the time of the samples in question helps to extract these segments in the best possible way.

Filtering is a crucial step. It allows **noise reduction** as most artifacts occur at known frequencies. At this point, the filtered signals were decimated according to the highest cutoff frequency. This leads to the construction of a vector of useful segments which is then normalized: mean zero and variance equal to one.

## Machine Learning and classifiers

One of the main purposes of Machine Learning is** classification**, that is the problem of identifying the class of a new target on the basis of knowledge extracted from a training set. A system that classifies is called a classifier. The classifiers extract a model from the dataset which then they use to classify the new instances. If a single instance can be expressed as a vector in a numerical space R ^ n the problem of classification can be traced back to the search for the closed surfaces that delimit the classes.

### Linear Classifiers

There are pairs (x, y) of inputs x ∈ X and desired outputs y ∈ Y. A **learning algorithm** must choose, based on the training examples, a function f: X → Y such that new examples**,** not contained in the training set, they are correctly mapped to the corresponding output. For practical reasons, the functions f are usually indexed by a set of parameters θ, ie \(y = f (x; \theta)\). Therefore, the task is to choose the function equivalent to the choice of parameters. In the binary case Y = {1, -1}, the linear classifier is represented by a single discriminating function given by the vector of the input characteristics ω and the bias b:

$$\begin{equation} \ f(x) = (ω · x ) + b \end{equation}$$

The input vector x is assigned to the class y∈ {1, -1} as follows:

$$\left\{\begin{array} \ +1\,if\,(\omega \cdot x) + b \geq0 \\ -1\,if\,(\omega \cdot x) + b\leq0 \end{array}\right.$$

Different algorithms based on linear classifiers determine the vector ω and the bias b. These parameters obtained in the training phase are used in the test phase to **predict the class** to which each test example belongs. Algorithms based on linear classifiers can be distinguished based on their performance.

### Linear Support Vector Machines (SVM)

The signals provided are high in size with a low signal-to-noise ratio. There is also another problem: the signal response varies due to EEG signal components unrelated to the brain activity of a single subject. SVM is a powerful approach for** pattern recognition **and in particular for large problems, so it is frequently used in BCI searches.

SVM is one of the most used tools for pattern classification because instead of estimating the probability densities of the classes, it directly solves the problem of interest, or determine the decisional surfaces between the classes (classification boundaries).

Given two classes of linearly separable multidimensional patterns, of all the possible separation hyperplanes, SVM determines the** one capable of separating **the classes with the greatest possible margin. The margin is the minimum distance of points of two classes in training set from the identified hyperplane.

Margin maximization is related to generalization. If the training set patterns are classified by a wide margin, one can “hope” that even test patterns close to the border between classes are managed correctly.

### Bayesian classifier

Bayes’ theorem is a fundamental technique for** experience-based pattern classification** (training set). Through the Bayesian approach, it would be possible to construct an optimal classifier if one were perfectly aware of both the a priori probabilities \(p(y_i)\), and the densities conditioned to the class \(p(x|y_i)\). Normally this information is rarely available and the approach adopted is to build a classifier from a set of examples.

To model \(p(x|y_i)\) a parametric approach is normally used and when possible, this distribution is made to coincide with that of a Gaussian or spline functions.

The most widely used techniques for estimating are the Maximum-Likelihood (ML) and the Bayesian Estimate which, although different in logic, lead to almost identical results. The **Gaussian distribution** is normally an appropriate model for most pattern recognition problems.

Sources