Multivariate analysis of MEG/EEG data

Introduction

The objective of this tutorial is to give an introduction to multivariate analysis of neuroimaging data. Multivariate methods aim to find task-related features in the data by predicting to which task single trials belong. The idea is that if the features can be used for prediction then those features must be important from the perspective of the task. Note that this is very different from classical statistical testing, where such features are identified by pooling over multiple trials and/or subjects and where the analysis typically assumes these features to be independent.

Background

Multivariate methods, such as classifiers, are important in the sense that they present an alternative approach to determine relevance of features in the data. Furthermore, classification is important for real-time analysis of MEG/EEG data, which allows different experimental designs, and more application-oriented research such as brain-computer interfacing and neurofeedback.

Procedure

In this tutorial we will use classifiers to analyze a brain-computer interfacing dataset which has been used in the following paper. The dataset consists of trials for a subject covertly attending to the left or right visual field. The goal then is to predict to which condition single trials belong. You can find the data here.

We start with preprocessed data which has been detrended and downsampled to 300 Hz. The trials start at cue onset and end 2.5 seconds later. The subject has been attending to either of both directions during this period.

Sensor level classification in the time domain

We will start by analyzing the data in the time domain for subject 1.

load covattpre1;

We now perform a timelock analysis in order to make the data suitable as input to ft_timelockstatistics. For the purpose of demonstration we will focus only on a subset of the channels at the end of the attention period.

cfg             = [];
cfg.channel     = {'MLO' 'MRO'}; % occipital channels only
cfg.latency     = [2.0 2.5]; % final bit of the attention period
cfg.parameter   = 'trial';
cfg.keeptrials  = 'yes'; % classifiers operate on individual trials
tleft   = ft_timelockanalysis(cfg,left);
tright  = ft_timelockanalysis(cfg,right);

Now we specify cross-validation as a method. This ensures that we will perform a classification of our data based on ten-fold cross-validation. This splits up the data into ten partitions or folds and attempts to build ten different classifiers using the remaining nine folds. The end result is then averaged over folds.

cfg         = [];
cfg.method  = 'crossvalidate';

We also need to specify a design matrix; this is simply a vector with labels 1 for the trials belonging to data for the first condition and labels 2 for trials belonging to data for the second condition

cfg.design  = [ones(size(tleft.trial,1),1); 2*ones(size(tright.trial,1),1)];

Finally, we call ft_timelockstatistics which uses the default classification procedure; namely a standardization of the data (subtraction of the mean and division by the standard deviation), followed by applying a linear support vector machine:

stat = ft_timelockstatistics(cfg,tleft,tright);

The stat structure now contains the classification performance (proportion of correctly classified trials)

stat.prob
>> 0.6118

and it is significant at a p=0.05

stat.significance
>> 0.0162

Note that we may be interested in other representations of classification performance such as the contingency matrix with true classes in rows and predicted classes in columns. One way to compute this is to use the option

cfg.metric = 'tabcounts';

when running ft_freqstatistics. A more direct way is to call

stat.cv.evaluate('metric','tabcounts')

ans =
  77    50
  49    79

which re-evaluates the results in the stat.cv object. This object contains all the results about the classification, including the trained classifiers. Please consult the help of the evaluate function for other available metrics.

We may also plot the parameters of the used classifier as if it were neuroimaging data. For each n-th condition we will obtain a stat.modeln field, containing the classification parameters for that condition. The interpretation of the parameters depends on the used classifier.

cfg             = [];
cfg.layout      = 'CTF275.lay';
cfg.xlim        = [2.0 2.5];
cfg.zparam      = 'model1';
cfg.comments    = '';
cfg.colorbar    = 'yes';
ft_topoplotER(cfg,stat);

Note that the plot is hard to interpret. If we look at individual features using ft_multiplotER then it will be found that all features are used due to the way classifiers operate. One way to solve this is to use dimensionality reduction. We will see an example later in this tutorial.

Exercise

* Suppose you use a dataset consisting of randomly generated data. What do you expect to happen when you test classifier performance using the same data? And what do you expect to happen if you use a second randomly generated dataset to test the classifier? Use the concepts of overfitting and generalization in your explanation.

* Suppose you try multiple different classification procedures and find at some point that you reach a classification performance that is significantly better than chance at p=0.05. Should you trust this result? Why (not)?

* Suppose we standardize the full dataset (subtract the mean and divide by the standard deviation for each feature) prior to splitting it up into training and test data. Why would this be considered to be a form of cheating when computing the final classification performance?

Sensor level classification in the frequency domain

We now try to classify the same data in the frequency domain. Therefore, we need to perform a frequency analysis. Let's focus on the alpha band at the end of the attention period.

cfg              = [];
cfg.output       = 'pow';
cfg.channel      = {'MLO' 'MRO'};
cfg.method       = 'mtmconvol';
cfg.taper        = 'hanning';
cfg.foi          = 8:2:14;
cfg.t_ftimwin    = ones(length(cfg.foi),1).*0.5;
cfg.toi          = 2.0:0.1:2.5;
cfg.keeptrials   = 'yes'; % classifiers operate on individual trials
tfrleft           = ft_freqanalysis(cfg, left);
tfrright          = ft_freqanalysis(cfg, right);

Now we call freqstatistics with crossvalidate as our method.

cfg         = [];
cfg.method  = 'crossvalidate';
cfg.design  = [ones(size(tfrleft.powspctrm,1),1); 2*ones(size(tfrright.powspctrm,1),1)];
stat = ft_freqstatistics(cfg,tfrleft,tfrright);

We now obtain an even better classification performance

stat.prob
>> 0.6905
stat.significance
>> 2.4352e-05

Again, we may plot the classifier parameters to obtain a so-called importance map

cfg             = [];
cfg.layout      = 'CTF275.lay';
cfg.zparam      = 'model1';
cfg.comment     = '';
cfg.colorbar    = 'yes';
ft_topoplotTFR(cfg,stat);

Instead of ten-fold cross-validation, we could also perform a so-called leave-one-out cross-validation. This can be realized by setting 'cfg.nfolds = inf'. However, this will take a lot of time since we need to build as many classifiers as there are trials in our data. Still, this is the most principled approach if it is computationally manageable.

Exercise

* Explain how the importance map should be interpreted. What is the information conveyed?

* Explain why it is important to perform cross-validation instead of just dividing the data into one training and one test set.

Dimensionality reduction

One important thing to consider when classifying neuroimaging data is that we wish to reduce as much as possible the number of features used to classify the data. There are two ways to achieve this: either we map the input data to another space with a smaller number of dimensions or we select a number of dimensions from the original space. Going back to our analysis of timelocked data, we could for instance use common spatial patterns to map our data to a different space

cfg         = [];
cfg.method  = 'crossvalidate';
cfg.design  = [ones(size(tleft.trial,1),1); 2*ones(size(tright.trial,1),1)];
cfg.mva = {standardizer() cspprocessor() svmmethod};
stat = ft_timelockstatistics(cfg,tleft,tright);
stat.prob
>> 0.8078
stat.significance
>> 2.0786e-12

or when analyzing the data in the frequency domain, we could perform a feature selection in the original space

cfg         = [];
cfg.method  = 'crossvalidate';
cfg.design  = [ones(size(tfrleft.powspctrm,1),1); 2*ones(size(tfrright.powspctrm,1),1)];
cfg.mva = {standardizer() logreg('L1',3)};

This multivariate analysis standardizes the data and subsequently calls L1 regularized logistic regression with regularization parameter L1 equal to 3. Since this classification procedure is pretty slow, for the purpose of demonstration, we use 80 percent of the data for training a classifier and the rest for testing classification performance.

cfg.nfolds = 0.80;

This gives the following result

stat = ft_freqstatistics(cfg,tfrleft,tfrright);
stat.prob 
>> 0.7600
stat.significance
>> 0.0164

Note that the result has a higher p-value, partly due to the lower number of test cases. However, if we inspect stat.model1 then we find that just a small number of features from the total of 912 possible features are used. Note that the classifier returns both a model1 and a model2; one for each experimental condition. This is also reflected in the topoplot. The positive values in the left hemifield probably indicate that for the left covert attention condition we have a synchronization of alpha at the ipsilateral side.

cfg             = [];
cfg.layout      = 'CTF275.lay';
cfg.zparam      = 'model1';
cfg.comment     = '';
cfg.colorbar    = 'yes';
ft_topoplotTFR(cfg,stat);

Exercise

* Use MatLab to compute the number of non-zero elements in stat.model1.

* Repeat the above analysis with a different value for L1. Explain the results.

* L1 is a free parameter in our model. How would you determine the optimal setting for this parameter?

* If we use more and more features then classification performance will first go up but eventually starts to degrade. Explain why this may happen.

* Suppose we wish to select the optimal features by testing all possible feature subsets. How many subsets do we need to test when we have n features in total?

* Explain the difference between L1 and L2 regularization

Transfer learning

Typically, different classifiers are constructed for individual datasets. However, sometimes it may be beneficial to combine multiple datasets. For example, in order to construct classifiers that in some way are related to each other. This is of importance when we wish to move from the individual to the group level in our analysis. Another benefit of this approach is that if we have little data to begin with, it can be useful to combine all data in order to obtain classifiers that show improved predictive performance. This is for instance important in the context of brain-computer interfacing, where we wish to construct good classifiers with as little data as possible. The idea of learning classifiers by combining multiple datasets is known as transfer learning.

A naive way to do transfer learning is to pool the trials of multiple datasets and apply a standard classifier. However, this is in general not a very good idea since it ignores the fact that there is real variability between these datasets. Hence, we would like to capture the commonalities as well as the differences between datasets. Here, we discuss one approach. We extend the gslr classifier of the previous section to a transfer learning setting. The resulting classifier will try to find the same features in multiple datasets but will fit parameters with respect to these features for individual datasets.

As an example we will split up the trials for our subject into two subsets to artifically create two separate datasets:


earlyleft  = tfrleft; earlyleft.powspctrm = tfrleft.powspctrm(1:63,:,:,:); 
earlyright = tfrright; earlyright.powspctrm = tfrright.powspctrm(1:63,:,:,:); 

lateleft  = tfrleft; lateleft.powspctrm = tfrleft.powspctrm(64:end,:,:,:); 
lateright = tfrright; tfrright.powscptrm = tfrright.powspctrm(64:end,:,:,:); 

In the design matrix we specify the conditions but now we also specify the dataset index in the cfg.dataset field. Furthermore, we call freqstatistics by listing the datasets in the sequence implied by the design. The classifier gslr_transfer will now perform the transfer learning for us. We obtain the following result:

cfg         = [];
cfg.method  = 'crossvalidate';
design1     = [ones(size(earlyleft.powspctrm,1),1); 2*ones(size(earlyright.powspctrm,1),1)];
design2     = [ones(size(lateleft.powspctrm,1),1); 2*ones(size(lateright.powspctrm,1),1)];
cfg.design  = [design1; design2];
cfg.dataset = [ones(length(design1),1); 2*ones(length(design2),1)];
cfg.mva = {standardizer() gslr_transfer('maxgroup',100)};
cfg.nfolds = 0.80;

stat = ft_freqstatistics(cfg,earlyleft,earlyright,lateleft,lateright);

stat.prob 
>> 0.7692 0.7949

If we plot the parameters for both datasets then we can observe the effect that the same features are selected whereas the estimated parameter values are different. Note that modelx_y stands for the x-th model of the y-th dataset.

cfg             = [];
cfg.layout      = 'CTF275.lay';
cfg.zparam      = 'model1';
cfg.comment     = '';

stat2 = stat;
stat.model1  = stat.model1_1;
stat2.model1 = stat2.model1_2;

subplot(1,2,1);
ft_topoplotTFR(cfg,stat);
subplot(1,2,2);
ft_topoplotTFR(cfg,stat2);

The figure shows the parameter plots for the two datasets. Note that the same features are found whereas the obtained parameter values are different.

Exercise

* BCI data is often collected over multiple sessions. Explain why transfer learning may help to get better results as compared to just combining the multisession data into one big dataset.

Conclusion

In this tutorial we have touched on a number of important issues in the classification of neuroimaging data. However, we have barely scratched the surface of this field since there are many more possibilities to explore. First, many different classification procedures can be devised that use different forms of preprocessing, feature selection and/or classification. In order to use some of the more advanced methods it is required to call lower level functions in FieldTrip's multivariate module directly. Second, some of the more sophisticated methods such as (dynamic) Bayesian networks have not been discussed although they are part of the multivariate module. Finally, in order to construct experimental designs that make use of multivariate analysis or to build BCI or neurofeedback applications we have developed the realtime module.

For the cognitive neuroscientist, we recommend to use the approach mentioned in this tutorial; i.e., classify in the time or frequency domain either using the standard classification procedure or the feature selection procedure. High and significant classification performance then shows that the effect of interest is detectable on the single trial level, possibly also indicating the regions for which this is the case.

—-

Direct link to the multivariate module

Exercise

* Explore the multivariate module and choose a classification pipeline which differs from the ones described above. Describe the characteristics of your pipeline and explain the obtained results.

tutorial/multivariateanalysis.txt · Last modified: 2010/04/13 09:45 by 131.174.138.164
Back to top
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0