Margin-based Semi-supervised Learning Using Apollonius circle
MONA EMADI AND JAFAR TANHA
T TC S 2 0 2 0
Margin-based Semi-supervised Learning Using Apollonius circle MONA - - PowerPoint PPT Presentation
Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S 2 0 2 0 Semi-supervised learning Training data Supervised All labeled data Model learning Some labeled data Semi-supervised Model learning
MONA EMADI AND JAFAR TANHA
T TC S 2 0 2 0
Supervised learning Semi-supervised learning Unsupervised learning All labeled data All unlabeled data Some labeled data Lots of unlabeled data Training data
Model Model Model
TTCS 2020
2
Unlabeled data Step 1: initial labeled training data L: Linit Step 2: f=learn classifier (L) Step 3: Apply f on the unlabeled data: U Step 4: Augment training data: LLself Step 5: Repeat step 2 Lself= k example with most confident predictions, remove these examples from the unlabeled pool
TTCS 2020
3
➢ Evaluating the use of data points that are close to the decision boundary for improving the classification performance. ➢ Proposing a geometric base selection metric to find informative unlabeled data points. ➢We define a new metric to measure the similarity between the labeled and unlabeled data points based on the proposed geometrical structure. ➢We address an agreement based approach for selection from the newly-labeled data based on classifier predictions and proposed neighborhood construction algorithm.
TTCS 2020
4
B C D M d2 d1 A
𝒆(𝑩,𝑵) 𝒆(𝑵,𝑪)
Apollonius circle : The Apollonius circle is the geometric location of the points on the Euclidean plane, the ratio of its distance two fixed points A and B is fixed K
TTCS 2020
5
TTCS 2020
6
B
K=1 K>1 K<1
A
𝑫𝑩𝑪 = ቐ 𝑫𝑩 𝑫𝑪 𝑫𝒋𝒐𝒈 𝒋𝒈 𝒋𝒈 𝒋𝒈 𝒍 < 𝟐 𝒍 > 𝟐 𝒍 = 𝟐
𝝇𝒋 = 𝐟𝐲𝐪(−(
𝟐 𝒔 σ𝑵𝒋𝝑𝑶(𝑵𝒌) 𝒆(𝑵𝒋, 𝑵𝒌)𝟑) 𝑒 𝑁𝑗, 𝑁
𝑘 =
𝑁𝑗 − 𝑁
𝑘
𝜺𝒋 = ቊ𝒏𝒋𝒐𝝇𝒋<𝝇𝒌 𝒏𝒃𝒚𝒌 {𝒆(𝑵𝒋, 𝑵𝒌) 𝒆(𝑵𝒋, 𝑵𝒌) ∃𝒌 , 𝒋𝒈 𝝇𝒋 < 𝝇𝒌 𝒑𝒖𝒊𝒇𝒔𝒙𝒋𝒕𝒇 ➢ Local density (𝜍𝑗) is defined as: ➢ 𝜀𝑗 is the minimum distance between 𝑁𝑗 and any other sample with higher density than 𝜍𝑗, which is define as:
𝑠 = 𝑞 × 𝑜
➢ Peaks (high density points) are obtained using the score function 𝒕𝒅𝒑𝒔𝒇(𝑵𝒋)= 𝝇𝒋 × 𝜺𝒋
TTCS 2020
7
Neighborhood groups with the Apollonius circle
𝑮𝒆𝑸𝒖 = 𝒏𝒃𝒚 𝒆 𝑸𝒖, 𝑵𝒋 |𝑵𝒋 ∈ 𝑵 𝒃𝒐𝒆 𝒆 𝑸𝒖, 𝑵𝒋 < 𝒆 𝑸𝒖, 𝑸𝒖+𝟐 𝒃𝒐𝒆 𝒆 𝑸𝒖, 𝑵𝒋 < 𝒏𝒋𝒐𝒎=𝟐,𝒎∈𝒒
𝒏
𝒆 𝑸𝒎, 𝑵𝒋 𝒕. 𝒖. 𝒖 ≠ 𝒎
Farthest data points are defined as:
Peak points :𝑄 = (𝑄
1, 𝑄2, … , 𝑄 𝑛)
𝑢 = {1,2, … , 𝑛 − 1} 𝑁: 𝑒𝑏𝑢𝑏 𝑞𝑝𝑗𝑜𝑢𝑡, 𝑁𝑗 ∉ 𝑄 M={𝑁𝑗|𝑗 ∈ {1,2, … , 𝑜 − 𝑛}
𝑮𝑸𝑸𝒖 = {𝑵𝒋|𝒆 𝑸𝒖, 𝑵𝒋 = 𝑮𝒆𝑸𝒖
Farthest data points
TTCS 2020
8
Example dataset for finding farthest points and grouping
𝑮𝒆𝟑 = 𝒏𝒃𝒚 𝒆 𝟑, 𝑵𝒋 |𝑵𝒋 ∈ 𝟐, 𝟒, 𝟓, 𝟕, 𝟖, 𝟘, 𝟐𝟏 𝒃𝒐𝒆 𝒆 𝟑, 𝑵𝒋 < 𝒆 𝟔, 𝑵𝒋 𝒃𝒐𝒆 𝒆 𝟑, 𝑵𝒋 < 𝒆 𝟗, 𝑵𝒋 𝒃𝒐𝒆 𝒆 𝟑, 𝑵𝒋 < 𝒆(𝟑, 𝟗) → 𝑮𝒆𝟑 = 𝒆 𝟑, 𝟒 → 𝑮𝑸𝟑 = 𝟒
TTCS 2020
9
Example for making neighborhood groups with the Apollonius circle
Class1 Class2 Unlabeled data Class1 Class2 Unlabeled data Peak1 Peak2 Class1 Class2 Unlabeled data Peak1 Peak2 farthest point
TTCS 2020
10
TTCS 2020
11
TTCS 2020
12
Name #Example #Attribute(D) #Class Iris 150 4 3 Wine 178 13 3 Seeds 210 7 3 Thyroid 215 5 3 Glass 214 9 6 Banknote 1372 4 2 Liver 345 6 2 Blood 748 4 2
TTCS 2020
13
Experimental results of comparisons accuracy of the algorithms with 10% labeled data
dataset Supervised SVM Self-training SVM STC-DPC algorithm Our algorithm Iris 92.50 87 91 95.76 Wine 88.30 90.81 86.96 91.40 Seeds 84.16 74.40 81.19 92.35 Thyroid 88.95 87.21 89.65 91.72 Glass 47.44 51.15 51.15 51.93 Banknote 98.39 98.77 98.12 96.62 Liver 58.04 57.31 55.29 61.90 Blood 72.42 72.58 72.01 74.98
TTCS 2020
14
Accuracy rate of our algorithm with all unlabeled data and near decision boundary unlabeled data
Dataset All unlabeled data Selected unlabeled Banknote 96.58 96.62 Liver 59.85 61.90 Blood 75.45 74.98 Heart 74.78 78.25 Hypothyroid 78.78 78.25 Diabetes 62.72 63.47 Parkinson 80.62 80.62
TTCS 2020
15
TTCS 2020
16
TTCS 2020
17
banknote
TTCS 2020
18
TTCS 2020
19
iris seeds wine
➢we proposed a semi-supervised self-training method based on Apollonius. ➢First candidate data are selected from the unlabeled data to be labeled in the self-training process. Then, the peak points are found using the density peak clustering. Apollonius circle corresponding to each peak point is formed and the label of peak point is assigned to unlabeled data points in the Apollonius circle. The applied base classifier is svm which is a margin based algorithm.
TTCS 2020
20
➢A series of experiments was performed on several datasets and the performance of the proposed algorithm was
➢The impact of selecting data close to decision boundary was investigated and it was found that data points close to decision boundary effects the optimal change of decision boundary more than the farthest ones and also improve the classification performance.
TTCS 2020
21
[1] Di Wu, Mingsheng Shang, X.L.J.X.H.Y.W.D., Wang, G., 2018. Self-training semi-supervised classification based on density peaks of data. Neurocomputing 275, No. C, 180-191. [2] Pourbahrami, S., Khanli, L.M., Azimpour, S., 2019. A novel and efficient data point neighborhood construction algorithm based on apollonius circle. Expert Systems with Applications 115, 57 - 67 . [3] Rodriguez, A., Laio, A., 2014. Clustering by fast search and find of density peaks. science 344, Issue 6191, 1492-1496. [4] Tanha, J., 2019. A multiclass boosting algorithm to labeled and unlabeled data. International Journal of Machine Learning and Cybernetics . [4] Tanha, J., van Someren, Maarten, A.H., 2017. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8, 355-370. [5] Tanha, J., van Someren, M., Afsarmanesh, H., 2014. Boosting for multiclass semi-supervised learning. Pattern Recognition Letters 37, 63-77. [6] Tanha, J., van Someren, M., Afsarmanesh, H., 2017. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8, 355-370 . [7] Zhou, Y., Kantarcioglu, M., Thuraisingham, B., 2012. Self-training with selection-by-rejection, ICDM ’12: proceedings of 2012 IEEE 12th International Conference on Data Mining.
TTCS 2020
22
emadi.mona@pnu.ac.ir
TTCS 2020
23