How To Use
Toy Example
Here is a toy example that demonstrates how to train a KGE model:
1import numpy as np
2from KGE.data_utils import index_kg, convert_kg_to_index
3from KGE.models.translating_based.TransE import TransE
4
5# load data
6train = np.loadtxt("./data/fb15k/train/train.csv", dtype=str, delimiter=',')
7valid = np.loadtxt("./data/fb15k/valid/valid.csv", dtype=str, delimiter=',')
8test = np.loadtxt("./data/fb15k/test/test.csv", dtype=str, delimiter=',')
9
10# index the kg data
11metadata = index_kg(train)
12
13# conver kg into index
14train = convert_kg_to_index(train, metadata["ent2ind"], metadata["rel2ind"])
15valid = convert_kg_to_index(valid, metadata["ent2ind"], metadata["rel2ind"])
16test = convert_kg_to_index(test, metadata["ent2ind"], metadata["rel2ind"])
17
18# initialized TransE model object
19model = TransE(
20 embedding_params={"embedding_size": 32},
21 negative_ratio=10,
22 corrupt_side="h+t",
23)
24
25# train the model
26model.train(train_X=train, val_X=valid, metadata=metadata, epochs=10, batch_size=64,
27 log_path="./tensorboard_logs", log_projector=True)
28
29# evaluate
30eval_result_filtered = model.evaluate(eval_X=test, corrupt_side="h", positive_X=np.concatenate((train, valid, test), axis=0))
First, you need to index the KG data. You can use KGE.data_utils.index_kg()
to index all entities and relation, this function return metadata of KG that mapping
all entities and relation to index. After creating the metadata, you can use
KGE.data_utils.convert_kg_to_index() to conver the string (h,r,t) into index.
After all preparation done for KG data, you can initialized the KGE model, train the model, and evaluate model.
You can monitor the training and validation loss, distribution of model parameters on
TensorBoard using the command tensorboard --logdir=./tensorboard_logs.
After the training procedure finished, the entities embedding are projected into lower
dimension and show on the Projector Tab in tensorboard
(if log_projector=True is given when train()).
Train KG from Disk File
The toy example above demonstrates how to train KGE model from KG data stored in Numpy Array, however, sometimes your KG may be too big that can not fit into memory, in this situation, you can train the KG from the disk file without loading them into memory:
1from KGE.data_utils import index_kg, convert_kg_to_index
2from KGE.models.translating_based.TransE import TransE
3
4train = "./data/fb15k/train"
5valid = "./data/fb15k/valid"
6
7metadata = index_kg(train)
8
9convert_kg_to_index(train, metadata["ent2ind"], metadata["rel2ind"])
10convert_kg_to_index(valid, metadata["ent2ind"], metadata["rel2ind"])
11train = train + "_indexed"
12valid = valid + "_indexed"
13
14model = TransE(
15 embedding_params={"embedding_size": 32},
16 negative_ratio=10,
17 corrupt_side="h+t"
18)
19
20model.train(train_X=train, val_X=valid, metadata=metadata, epochs=10, batch_size=64,
21 log_path="./tensorboard_logs", log_projector=True)
We use the same function KGE.data_utils.index_kg() and
KGE.data_utils.convert_kg_to_index() to deal with KG data stored in disk.
If the input of KGE.data_utils.convert_kg_to_index() is a string path folder
but a numpy array, it won’t return the indexed numpy array, instead it save the indexed KG
to the disk with suffix _indexed.
Data folder can have multiple CSVs that store the different partitions of KG like that:
./data/fb15k
├── test
│ ├── test.csv
│ ├── test1.csv
│ └── test2.csv
├── train
│ ├── train.csv
│ ├── train1.csv
│ └── train2.csv
└── valid
├── valid.csv
├── valid1.csv
└── valid2.csv
Train-Test Splitting KG Data
In the example above we use the benchmark dataset FB15K which is split into the train, valid, test already, but when you bring your own KG data, you should split data by yourself. Note that when splitting the KG data, we need to guarantee that the entities in test data are also present in the train data, otherwise, the entities not in the train would not have embeddings being trained.
You can use KGE.data_utils.train_test_split_no_unseen() to split
the KG data that guarantee the entities in test data are also present in
the train data.
Warning
KGE.data_utils.train_test_split_no_unseen() only support for numpy array.
import numpy as np
from KGE.data_utils import train_test_split_no_unseen
KG = np.array(
[['DaVinci', 'painted', 'MonaLisa'],
['DaVinci', 'is_a', 'Person'],
['Lily', 'is_interested_in', 'DaVinci'],
['Lily', 'is_a', 'Person'],
['Lily', 'is_a_friend_of', 'James'],
['James', 'is_a', 'Person'],
['James', 'like', 'MonaLisa'],
['James', 'has_visited', 'Louvre'],
['James', 'has_lived_in', 'TourEiffel'],
['James', 'is_born_on', 'Jan,1,1984'],
['LaJocondeAWashinton', 'is_about', 'MonaLisa'],
['MonaLis', 'is_in', 'Louvre'],
['Louvre', 'is_located_in', 'Paris'],
['Paris', 'is_a', 'Place'],
['TourEiffel', 'is_located_in', 'Paris']]
)
train, test = train_test_split_no_unseen(KG, test_size=0.1, seed=12345)