Analyzing Bank Fraud Dataset using Python (Neo4j, Spark, NetworkX)
Graph data structures are widely used in many fields, including finance. In particular, social network analysis (SNA) can be used to analyze financial transaction data and detect fraudulent activity. In this blog post, we will explore Python code that analyzes a bank fraud dataset using two methods first using Neo4j as a distributed database and Spark Data Frame for big data processing, making it easier to store and analyze large-scale graph data and second through the NetworkX library.
Using Neo4j
By using graph algorithms to analyze the network structure of the data, we can extract important nodes in the graph and perform online analytical processing using distributed clustering packages through Neo4j. Finally, applying Support Vector Machine or Neural Network on a distributed environment for link prediction on the stored graph can provide better insights into fraudulent activities.
We will be using the Cypher query language and the Neo4j graph database to run three queries that can be used to extract important information from a graph.
Query 1: Top 10 nodes with the highest degree of centrality
Degree centrality is a measure of the number of edges (or connections) a node has in a network. A node with a high degree of centrality is well-connected and has many relationships with other nodes in the network. This can indicate that the node plays an important role in the network.
The following Cypher query can be used to extract the top 10 nodes with the highest degree of centrality:
MATCH (n)-[r]-()
WITH n, count(r) AS degree
RETURN n.id AS node, degree
ORDER BY degree DESC LIMIT 10
This query matches all nodes and edges in the graph, calculates the degree of centrality of each node, and returns the top 10 nodes with the highest degree of centrality. The count()
function is used to count the number of edges connected to each node.
Query 2: Top 10 nodes with the highest PageRank centrality
PageRank centrality is a measure of the importance of a node in a network based on the number and quality of the nodes that link to it. It is commonly used in web search engines to rank web pages based on their importance. A node with a high PageRank centrality is linked to many other important nodes in the network.
The following Cypher query can be used to extract the top 10 nodes with the highest PageRank centrality:
CALL apoc.periodic.iterate(
"MATCH (n:Account) RETURN id(n) AS id",
"CALL apoc.algo.pageRankWithConfig(id, { iterations: 20, write: true, nodeAutoCreate: true, dampingFactor: 0.85 })
YIELD nodes, iterations, load, dampingFactor, write, writeRelationshipType
RETURN id, nodes, iterations, load, dampingFactor, write, writeRelationshipType",
{batchSize: 100, parallel: true}
) YIELD batches, total
WITH batches, total
MATCH (n:Account)
RETURN n.id AS node, n.pagerank AS pagerank
ORDER BY pagerank DESC LIMIT 10
This query uses the APOC library to calculate the PageRank centrality of each node in the graph. The apoc.algo.pageRankWithConfig()
function is used to calculate the PageRank centrality with a damping factor of 0.85 and 20 iterations. The RETURN
statement returns the top 10 nodes with the highest PageRank centrality.
Query 3: Subgraph of 1000 nodes with p=0.5
A subgraph is a smaller network that is extracted from a larger graph based on certain criteria. The following query extracts a subgraph of 1000 nodes with a node probability p
of 0.5:
MATCH (a:Account)
WHERE rand() < 0.5
WITH a
LIMIT 1000
MATCH (a)--[r:TRANSFER]-->(b)
RETURN a, r, b
This query matches all nodes in the graph, randomly selects nodes with a probability of 0.5, and limits the results to 1000 nodes. It then matches all edges in the subgraph and returns the nodes and edges in the subgraph.
Using NetworkX Library
The first step in this analysis is to load the financial transaction data from a CSV file named ‘BFD.csv’. This can be done using the Pandas library with the pd.read_csv() function. The code reads the first 5000 rows of the CSV file and stores it in a Pandas data frame.
Next, the NetworkX library is used to create a directed graph from the transaction data. The nx.DiGraph() function is used to create a new directed graph, and the G.add_nodes_from() function is used to add nodes to the graph. The for loop iterates over the rows of the Pandas data frame and adds edges to the graph using the ‘nameOrig’ and ‘nameDest’ nodes with a weight equal to the ‘amount’ of the transaction.
Once the graph is created, it can be visualized using the nx.draw() and plt.show() functions. This allows us to see the structure of the graph and the relationships between the nodes.
Now that we have our graph, we can calculate different centrality measures for each node in the graph. Centrality measures are used to identify important nodes in a graph based on their position or influence. Several different centrality measurusedan be used, and in this code, we will be using three: degree centrality, betweenness centrality, and eigenvector centrality.
Degree centrality measures the number of edges that a node has in the graph. Nodes with a high degree of centrality are well-connected and have a large number of neighbors. Betweenness centrality measures the number of shortest paths that pass through a node in the graph. Nodes with a high betweenness centrality are often cgraph's connectivityty of the graph. Eigenvector centrality measures the influence of a node based on the influence of its neighbors. Nodes with a high eigenvector centrality are connected to other important nodes in the graph.
To calculate the centrality measures, the code uses the nx.degree_centrality(), nx.betweenness_centrality(), and nx.eigenvector_centrality() functions. These functions return a dictionary of nodes with their corresponding centrality values. The sorted() function is used to extract the top 10 nodes with the highest centrality values for each centrality measure.
In conclusion, this blog provides a simple and effective way to analyze financial transaction data using the NetworkX library. By creating a directed graph and calculating centrality measures, we can identify important nodes in the graph and detect potentially fraudulent activity.