summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README.md28
-rw-r--r--tree.py14
2 files changed, 15 insertions, 27 deletions
diff --git a/README.md b/README.md
index 67ba629..1579e7c 100644
--- a/README.md
+++ b/README.md
@@ -1,25 +1,13 @@
-# 2020_data_mining_assignments
+# 2020 project for the data mining course
* [link to assignment](http://www.cs.uu.nl/docs/vakken/mdm/assignment1-2020.pdf)
* [link to article datasets part 2](https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/promise2007-dataset-20a.pdf)
-## Part1, tree algorithm/implementation:
-- [X] [tree_grow](https://github.com/Vinkage/2020_data_mining_assignments/blob/e650ad27d13b392f5b6535906e36176cb0777650/assignment1.py#L321-L406) functie die het [pseudocode in de slides](./media/tree_grow_pseudo_code.png) volgt
-- [X] tree_grow aanpassen voor n_feat, een paar lines die zeggen dat de hoeveelheid cols die aan exhaustivesplitsearch gegeven worden random uit x gepakt moeten worden
-- [X] tree_grow_b bootstrap versie van tree grow, die een lijst van tree construct door met replacement rows uit x te kiezen
-- [X] [tree_pred functie](https://github.com/Vinkage/2020_data_mining_assignments/blob/da8ca975fb9d11d3801fef66344736e675734c42/assignment1.py#L77-L103) met efficiente conditional branches
-- [X] tree_pred_b een functie die een lijst van tree kan gebruiken om een voorspelling te maken voor rows in een data array x
-- [X] Figure out how we want to compute the confusion matrix (scipy?)
-- [X] Test prediction of single tree on pima indians data with nmin 20 and minleaf 5, check with confusion matrix in [link to assignment](http://www.cs.uu.nl/docs/vakken/mdm/assignment1-2020.pdf)
-- [X] Test prediction on [educode](https://uu.educode.nl/login/?next=/submissions/646/feedback/95016)
+This project is a from scratch implementation of a classification tree using
+the theory we learned in the course. We were awarded a 10/10 for the coding
+part, and 8.5/10 for the report, the code was said to be very readable, and we
+also avoided time bottle necks using numpy.
-
-## Part2, data analysis:
-- [ ] Datasets collecten uit de literature
-- [ ] Datasets describen, exploren/plotten/formatten als het nodig is
-
-### Official steps
-![](./media/steps_data_anal.png)
-
-## The report
-![](./media/report_reqs.png)
+The concepts of impurity reduction and the gini-index were used to construct an
+algorithm that computes the "best split" at each step. We were also required to
+implement ensembling methods.
diff --git a/tree.py b/tree.py
index dcd5551..c5ecb65 100644
--- a/tree.py
+++ b/tree.py
@@ -1,10 +1,10 @@
import numpy as np
from sklearn import metrics
-#- Names and student no.:
+#- Made by:
-# Hunter Sterk 6981046
-# Lonnie Bregman 6980562
-# Mike Vink 5585791
+# Hunter Sterk
+# Lonnie Bregman
+# Mike Vink
#- Main functions:
@@ -173,7 +173,7 @@ from sklearn import metrics
# the classes vector. Note that when the number of 1 and 0 elements are
# equal, it returns 0.
-# EXAMPLE:
+# EXAMPLE:
# >>> y
# array([0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])
# >>> major_vote(y)
@@ -221,7 +221,7 @@ from sklearn import metrics
# (1.4285714285714286, array([False, False, False, False, False, False, True, True, True,
# False]), array([ True, True, True, True, True, True, False, False, False,
# True]), 36.0)
-#
+#
# """
@@ -427,7 +427,7 @@ def tree_grow_b(x=None,
def tree_pred_b(x=None, tr=None, true=None) -> np.array:
"""
The repeated application of tree.predict to construct a 2D array which is
- used to make a majority vote label prediction for the rows in x.
+ used to make a majority vote label prediction for the rows in x.
"""
y_bag = np.zeros((len(x), len(tr)))
for i, tree in enumerate(tr):