Semi-Supervised Classification with Graph Convolutional Networks (GCNs)
In recent years, the explosive growth of data has presented both unprecedented opportunities and challenges in the field of machine learning. One such challenge is the classification of data points when labeled training samples are scarce. Traditional supervised learning heavily relies on labeled data, limiting its application in scenarios where acquiring labels is impractical. To address this, researchers have turned to semi-supervised learning. This approach combines labeled and unlabeled data to improve classification accuracy. Graph Convolutional Networks (GCNs) have emerged as a powerful tool within semi-supervised learning. GCNs are well-suited for graph-structured data. In our article, we delve into the concept of semi-supervised classification with GCNs. We explore how this innovative technique revolutionizes complex data classification tasks. In this article we’ll be covering the following:
- Intro to Graph Convolutional Networks
- Semi-supervised VS Supervised Learning
- Inductive Semi-Supervised Learning VS Transudative Semi-Supervised Learning
- Implementation of graph node classification in a semi-supervised setup using GCNs (PyTorch).
Introduction to Graph Convolutional Networks
GCNs are a special type of Convolution Neural Network (CNNs) where the model deals with graph structured data. There are common types of problems that GCNNs deal with:
- Node Classification: Where each node needs to be classified based on the neighborhood nodes.
- Graph Classification: Where the whole graph needs to be classified.
- Link Predictions: when you are interested in predicting the type of links between nodes. e.g. bound prediction between atoms in medicinal chemistry.
To read more about GCNs, Refer to our detailed article about Graph Neural Networks and their applications.
Semi-supervised VS Supervised Learning
Semi-supervised learning is a machine learning method thatΒ falls between supervised and unsupervised learning. In Supervised learning, we train a model on a labeled dataset where we provide the correct output for each example in the training set.Β In Unsupervised learning,Β the model is not provided with labeled training examples, and must find patterns and relationships in the data on its own.
On the other hand, in semi-supervised learning, we train the model on a dataset that has partial labeling. This means that only some of the examples in the training set have known, correct outputs.Β The model is then able to use this information, along with the patterns it finds in the unlabeled data, to make predictions on new examples.
Figure 3 shows the differences between the semi supervised and the supervised learning. so, By example, Assume you have 8 red points labeled examples and 2 red points unlabeled examples for Class A and 8 labeled examples and 2 green points and 2 green unlabeled examples for Class B. In supervised learning we use only 8 red Labeled points from class A and 8 green Labeled points from class B and forget about the unlabeled examples as we don’t have its label
Inductive Semi-Supervised Learning VS Transudative Semi-Supervised Learning
In the realm of semi-supervised learning, two primary approaches have gained prominence: inductive semi-supervised learning and transudative semi-supervised learning. Inductive semi-supervised learning aims to generalize the underlying patterns and relationships from labeled and unlabeled data to make predictions on new, unseen data points. It focuses on building a model that can accurately classify unlabeled instances by leveraging information from both labeled and unlabeled samples during training. On the other hand, transudative semi-supervised learning is concerned with making predictions specifically on the given unlabeled instances, without attempting to generalize to unseen data. It focuses on directly inferring labels for the unlabeled data by exploiting the relationships and similarities between the labeled and unlabeled samples. Both approaches have their unique advantages and trade-offs. Understanding their distinctions is crucial in choosing the most suitable method for specific machine learning tasks.
- In inductive semi-supervised learning, the learner has both labeled training (π) data {(π±π,π¦π)} π=1 βΌ π(π±,π¦) and unlabeled training data {π±j} j=π+1 βΌ p(x), and learns a predictor π : X β¦ Y f : X β¦ Y, π β F where F is the hypothesis space. Here xβX is an input instance, yβY its target label (discrete for classification or continuous for regression), p(x, y) the unknown joint distribution and p(x) its marginal. The goal is to learn a predictor that predicts future test data better than the predictor learned from the labeled training data alone.
- In Transudative learning, which has been understood since the very first theorems of Vapink-Chervonenkis (VC). firstly introduced in the the setting is the same except that one is solely interested in the predictions on the unlabeled training data {πj} j=π+1, without any intention to generalize to future test data [1].
Code implementation of Graph Convolutional Networks in a semi-supervised setup (PyTorch)
In the next code example we will cover a node level classification of Cora Dataset [2] using a semi-supervised approach.
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
We encourage you to first read our previous blog post about Graph neural network (GCN) to easily follow the next section, In which we implement a Semi-Supervised Classification with GCNs [3].
First let’s define our model:
import torch.nn as nn
import torch.nn.functional as F
from pygcn.layers import GraphConvolution
class GCN(nn.Module):
def __init__(self, nfeat, nhid, nclass, dropout):
super(GCN, self).__init__()
self.gc1 = GraphConvolution(nfeat, nhid)
#self.gc2 = GraphConvolution(2*nhid, nhid)
#self.gc3 = GraphConvolution(nhid, nhid)
self.gc4 = GraphConvolution(nhid, nclass)
self.dropout = dropout
def forward(self, x, adj):
x = F.relu(self.gc1(x, adj))
#x = F.relu(self.gc2(x, adj))
#x = F.relu(self.gc3(x, adj))
x = F.dropout(x, self.dropout, training=self.training)
x = self.gc4(x, adj)
return F.log_softmax(x, dim=1)
model = GCN(nfeat=features.shape[1],
nhid=args.hidden,
nclass=labels.max().item() + 1,
dropout=args.dropout)
print(model)
#GCN(
#(gc1): GraphConvolution (1433 -> 16)
#(gc4): GraphConvolution (16 -> 7)
#)
Our model consists of two Graph Convolution layers. The first layer transforms each node’s 1433-dimensional feature vector (representing unique words in a paper) into a hidden layer of dimension 16. The second layer uses the hidden layer to produce SoftMax probabilities for the paper’s membership in 7 classes.
Now let’s discover the Cora Dataset:
adj, features, labels, idx_train, idx_val, idx_test = load_data()
adj.shape
#torch.Size([2708, 2708])
features.shape
#torch.Size([2708, 1433])
idx_train.shape
#torch.Size([140])
idx_val.shape
#torch.Size([300])
idx_test.shape
#torch.Size([1000])
As we see the load_data() function returns random indices of nodes for training [140], validation [300] and testing [1000].
Note: we loaded the dataset with some function helpers that will be found in our final notebook [4]. But for now we just want understand the big picture.
Now let’s see how we can achieve a semi-supervised setup during training:
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features, adj)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer.step()
if not args.fastmode:
# Evaluate validation set performance separately,
# deactivates dropout during validation run.
model.eval()
output = model(features, adj)
loss_val = F.nll_loss(output[idx_val], labels[idx_val])
acc_val = accuracy(output[idx_val], labels[idx_val])
print('Training-validation results:'
'Epoch: {:04d}'.format(epoch+1),
'loss_train: {:.4f}'.format(loss_train.item()),
'acc_train: {:.4f}'.format(acc_train.item()),
'loss_val: {:.4f}'.format(loss_val.item()),
'acc_val: {:.4f}'.format(acc_val.item()),
'time: {:.4f}s'.format(time.time() - t))
In this function we used all the testing and validation nodes for updating the nodes assuming they have no labels. However, when we calculate the loss and accuracy during the training, we used the training nodes only.
Finally, we perform the testing on the test set nodes:
def test():
model.eval()
output = model(features, adj)
loss_test = F.nll_loss(output[idx_test], labels[idx_test])
acc_test = accuracy(output[idx_test], labels[idx_test])
print("Test set results:",
"loss= {:.4f}".format(loss_test.item()),
"accuracy= {:.4f}".format(acc_test.item()))
# Train model
t_total = time.time()
for epoch in range(args.epochs):
train(epoch)
print("Optimization Finished!")
print("Total time elapsed: {:.4f}s".format(time.time() - t_total))
# Testing
test()
Conclusion
- Semi-supervised learning is about using unlabeled examples to during training to get information from it
- Training and testing features are used during training but the loss is calculated only using the training samples (Labeled).
We also implemented Graph Attention Transformer (GAT) for semi-supervised learning semi-supervised learning which could be found in the following Github repository [5].