Graph Neural Networks (GNNs) and it’s Applications
Deep Learning is good at capturing hidden patterns of Euclidean data (images, text, videos). But what about applications where data is generated from non-Euclidean domains, represented as graphs with complex relationships and interdependencies between objects? That’s where Graph Neural Networks (GNN) come in, we’ll explore in this article Graph Neural Network (GNNs) and it’s Applications.
Real world objects are often defined in terms of their connections to other things. A set of objects, and the connections between them, are naturally expressed as a graph. Researchers have developed neural networks that operate on graph data (called graph neural networks, or GNNs) for over a decade [1].
In this article We’ll start with graph representation and basic definitions, move on to Graph convolutional Neural Networks, then finish with GNNs different types of models and the most common problems that graph is trying to solve. through this article we will cover the following:
Graph Representation
In our previous Article Graphs for Graph Neural Networks [2] we illustrated the graph structure an how we represents Graphs from 2D images, throughout this section we will recap the graph important definitions, then we will start directly by introducing the graph models and the problems Graphs are trying to solve.
Any Graph (G) is composed of 2 main components, set of Vertices (Nodes) V and link between them (Edges) E (See figure 2). Example, pairs (u, v) representing 2 connected nodes u, v ∈ V. Note, undirected graph is the one which (u, v) ∈ E ⟹ (v, u) ∈ E.
The most common way to represent edges E is the Adjacency matrix (A) , Adjacency matrix is a binary square matrix of size |V| x |V|, where A𝑢,𝑣 = 1 if there is a connection between nodes u, v (see figure 1)

What are Graph Convolution Neural Networks?
Graph Convolution Neural Networks are a special type of Convolution Neural Network (CNNs) where the model deals with graph structured data. There are common types of problems that GCNNs deal with:
- Node Classification: Where each node needs to be classified based on the neighborhood nodes.
- Graph Classification: Where the whole graph needs to be classified.
- Link Predictions: when you are interested in predicting the type of links between nodes. e.g. bound prediction between atoms in medicinal chemistry.

These problems (Fig 2) could be tackled by different types of models and theories that have been developed based on the Convolution Neural Network idea, among these models are:
- Graph Convolutional Neural Network GCNNs.
- Attention Graph Neural Network.
- Message Passing Neural Network.
Throughout this article we will cover simple GCNNs model and Attention GNNs model .
Graph Convolutional Neural Network (Node Classification)
Graph Convolution Neural Network was firstly introduced in 2016 by Kipf et al. [3]. While the main idea behind GCNNs is message passing in which each node updates its own features by receiving the features from the neighboring nodes then sum or aggregate all these features which is called aggregation step (Figure 3).

The main mathematical equation that describes the message passing in the graph convolution operation could be described as follow:

Where σ represents arbitrary activation function, h represents features of the node, a is the adjacency matrix and N(v) is the neighborhood vertex on the current node we are updating its weights. we encourage readers to do adjacency and features matrix of all node multiplication to see that only connected nodes could affect each others.
formally what happens is described in (Equation 2) where first, we multiply the adjacency matrix with the diagonal to give some weights to each nodes. Then the message passing receiving mechanism is achieved by multiplying the Diagonal adjacency result with nodes’ features matrix as follow:

Where L is the the current iteration, H feature matrix, W is weight learnable parameters which transform input features matrix WH into messages, D is the degree matrix which its diagonal element represent how many nodes connected to each node and σ represents an arbitrary activation function not necessarily the sigmoid (usually a ReLU-based activation function is used in GNNs).
Python Code implementation
In this section we will implement a Graph Convolution Neural Network model (using pytorch library) and try it on a node classification dummy example.
import time
import argparse
import numpy as np
import torch
import torch.nn.functional as F
import torch.optim as optim
import math
import torch
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module
class GraphConv(Module):
"""
Simple GCN layer, similar to https://arxiv.org/abs/1609.02907
"""
def __init__(self, in_features, out_features, bias=True):
super(GraphConv, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.weight = Parameter(torch.FloatTensor(in_features, out_features))
if bias:
self.bias = Parameter(torch.FloatTensor(out_features))
else:
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self):
stdv = 1. / math.sqrt(self.weight.size(1))
self.weight.data.uniform_(-stdv, stdv)
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)
def forward(self, input, adj):
support = torch.mm(input, self.weight)
output = torch.spmm(adj, support)
if self.bias is not None:
return output + self.bias
else:
return output
import torch.nn as nn
import torch.nn.functional as F
class GCN(nn.Module):
def __init__(self, nfeat, hid, nclass, dropout):
super(GCN, self).__init__()
self.gconv1 = GraphConv(nfeat, hid)
self.gconv2 = GraphConv(hid, hid)
self.dropout = dropout
self.fc = nn.Linear(hid, nclass)
def forward(self, x, adj):
x = F.relu(self.gconv1(x, adj))
x = F.dropout(x, self.dropout, training=self.training)
x = self.gconv2(x, adj)
x=self.fc(x)
return F.log_softmax(x, dim=1)
Defining dummy feature matrix of 4 nodes and their adjacency matrix.
node_feats = torch.arange(8, dtype=torch.float32).view(1, 4, 2)
adj_matrix = torch.Tensor([[[1, 1, 0, 0],
[1, 1, 1, 1],
[0, 1, 1, 1],
[0, 1, 1, 1]]])
print("Node features:\n", node_feats)
print("\nAdjacency matrix:\n", adj_matrix)
import numpy as np
np.random.seed(args.seed)
model = GCN(nfeat=2 ,
hid=16,
nclass=2,
dropout=0.5)
model(node_feats.squeeze(0),adj_matrix.squeeze(0))
#tensor([[-0.9284, -0.5029],
#[-1.2041, -0.3566],
#[-1.0035, -0.4566],
#[-1.0035, -0.4566]], grad_fn=
#<LogSoftmaxBackward0>)
Attention Graph Neural Networks
Graph attention mechanism is a very similar idea to the main idea of a Transformer [4] where the idea of an attention function can be described as mapping a query and a set of key-value pairs to an output. Similarly in a graph, the graph attention gets its query from the node itself and keys and values from neighboring nodes (messages).
The graph attention mechanism equation as follows:

Where || is concatenation operation, hi and hj are the original features from node i and j respectively, and αij is the final attention node weight from i to j. note query here is the feature’s node itself, and key is the neighboring node.
We can update the feature vector of the node once the attention matrix is calculated as follow:

Where σ represents non linear activation function. and here Whj term could be consider as values term from the transformers prospective. we can visualize the attention process as follow:

Now let’s implement a Graph attention model using Deep Graph Library on The Cora dataset.
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links [5]. A big enough graph represents all the publications where each node represent a publication.
Let’s first define attention model using the DGL library:
class GATLayer(nn.Module):
def __init__(self, g, in_dim, out_dim):
super(GATLayer, self).__init__()
self.g = g
# equation (1)
self.fc = nn.Linear(in_dim, out_dim, bias=False)
# equation (2)
self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)
self.reset_parameters()
def reset_parameters(self):
"""Reinitialize learnable parameters."""
gain = nn.init.calculate_gain('relu')
nn.init.xavier_normal_(self.fc.weight, gain=gain)
nn.init.xavier_normal_(self.attn_fc.weight, gain=gain)
def edge_attention(self, edges):
# edge UDF for equation (2)
z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)
#print('z2', z2.shape)
a = self.attn_fc(z2)
#print('a', a.shape)
return {'e': F.leaky_relu(a)}
def message_func(self, edges):
# message UDF for equation (3) & (4)
return {'z': edges.src['z'], 'e': edges.data['e']}
#source nodes features || edge features
def reduce_func(self, nodes):
# reduce UDF for equation (3) & (4)
# equation (3)
alpha = F.softmax(nodes.mailbox['e'], dim=1)
#print(nodes.mailbox['e'].shape)
# equation (4)
h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
#print(h.shape)
return {'h': h}
def forward(self, h):
# equation (1)
z = self.fc(h)
#print('z',z.shape)
self.g.ndata['z'] = z
# equation (2)
self.g.apply_edges(self.edge_attention)
# equation (3) & (4)
self.g.update_all(self.message_func, self.reduce_func)
return self.g.ndata.pop('h')
Now let’s call the Cora dataset and run the model on it as a test:
from dgl import DGLGraph
from dgl.data import citation_graph as citegrh
import networkx as nx
def load_cora_data():
data = citegrh.load_cora()
features = torch.FloatTensor(data.features)
labels = torch.LongTensor(data.labels)
mask = torch.BoolTensor(data.train_mask)
g = data[0]
return g, features, labels, mask
g, features, labels, mask = load_cora_data()
net1= GATLayer(g,
features.size()[1],
out_dim=7)
net1(features).shape
Now let’s split the code into Equations to understand each step.
For the first equation, it’s a simple linear transformation of the input nodes features into specific dimension as the follows:

def forward(self, h):
# equation (1)
z = self.fc(h)
After that, The apply_edges function is Edge UDF that takes edge level function as an input, which we encourage to read the documentation carefully to know how it’s used, but it works on the nodes features through the edges, which calculate the attention score(e) be taking Leaky Relu of the MLP output of the nodes features with neighboring node features, And the definitions of the edge function as follow:
def edge_attention(self, edges):
# edge UDF for equation (2)
z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)
#print('z2', z2.shape)
a = self.attn_fc(z2)
#print('a', a.shape)
return {'e': F.leaky_relu(a)}
The reduce_func function implements the third and the forth equation which multiply the linear transformed feature vector at each node with the attention score on each node and update_all function takes this function as input to update the features using it.
So, in summary the reduce_func has two main tasks:
- Normalize the attention score using softmax
- multiply the the normalized attention score with the feature vector of each node.
def reduce_func(self, nodes):
# reduce UDF for equation (3) & (4)
# equation (3)
alpha = F.softmax(nodes.mailbox['e'], dim=1)
#print(nodes.mailbox['e'].shape)
# equation (4)
h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
#print(h.shape)
return {'h': h}
Multi Attention Graph Convolution
Now let’s put it all together first by making multi attention module that repeats the graph attention layer n-times (num_heads). The below MultiHeadGATLAyer class uses the GATLayer as the main building block:
class MultiHeadGATLayer(nn.Module):
def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):
super(MultiHeadGATLayer, self).__init__()
self.heads = nn.ModuleList()
for i in range(num_heads):
self.heads.append(GATLayer(g, in_dim, out_dim))
self.merge = merge
def forward(self, h):
head_outs = [attn_head(h) for attn_head in self.heads]
#print(head_outs[0].shape)
if self.merge == 'cat':
# concat on the output feature dimension (dim=1)
return torch.cat(head_outs, dim=1)
else:
# merge using average
return torch.mean(torch.stack(head_outs))
Now let’s put it all together in one class and call it GAT:
class GAT(nn.Module):
def __init__(self, g, in_dim, hidden_dim, out_dim, num_heads):
super(GAT, self).__init__()
self.layer1 = MultiHeadGATLayer(g, in_dim, hidden_dim, num_heads)
# Be aware that the input dimension is hidden_dim*num_heads since
# multiple head outputs are concatenated together. Also, only
# one attention head in the output layer.
self.layer2 = MultiHeadGATLayer(g, hidden_dim * num_heads, out_dim, 1)
def forward(self, h):
h = self.layer1(h)
h = F.elu(h)
h = self.layer2(h)
return h
Finally let’s train the model on the Cora dataset
limport time
import numpy as np
g, features, labels, mask = load_cora_data()
# create the model, 2 heads, each head has hidden size 8
net = GAT(g,
in_dim=features.size()[1],
hidden_dim=8,
out_dim=7,
num_heads=2)
# create optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
# main loop
dur = []
for epoch in range(50):
if epoch >= 3:
t0 = time.time()
logits = net(features)
logp = F.log_softmax(logits, 1)
loss = F.nll_loss(logp[mask], labels[mask])
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch >= 3:
dur.append(time.time() - t0)
print("Epoch {:05d} | Loss {:.4f} | Time(s) {:.4f}".format(
epoch, loss.item(), np.mean(dur)))
References & Links
- [1] A Survey on Graph Neural Networks and GraphTransformers in Computer Vision: A Task-Oriented Perspective
- [2] Graphs for Graph Neural Networks
- [3] Semi-Supervised Classification with Graph Convolutional Networks
- [4] Attention all you need
- [5] Cora Dataset
- [6] Tutorial 7: Graph Neural Networks
- [7] Graph Convolutional Networks in PyTorch: Github Repo.
- [8] Deep Graph Library Documentation.
- [9] A Comprehensive Introduction to Graph Neural Networks (GNNs)