diff options
| -rw-r--r-- | README.md | 28 | ||||
| -rw-r--r-- | tree.py | 14 |
2 files changed, 15 insertions, 27 deletions
@@ -1,25 +1,13 @@ -# 2020_data_mining_assignments +# 2020 project for the data mining course * [link to assignment](http://www.cs.uu.nl/docs/vakken/mdm/assignment1-2020.pdf) * [link to article datasets part 2](https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/promise2007-dataset-20a.pdf) -## Part1, tree algorithm/implementation: -- [X] [tree_grow](https://github.com/Vinkage/2020_data_mining_assignments/blob/e650ad27d13b392f5b6535906e36176cb0777650/assignment1.py#L321-L406) functie die het [pseudocode in de slides](./media/tree_grow_pseudo_code.png) volgt -- [X] tree_grow aanpassen voor n_feat, een paar lines die zeggen dat de hoeveelheid cols die aan exhaustivesplitsearch gegeven worden random uit x gepakt moeten worden -- [X] tree_grow_b bootstrap versie van tree grow, die een lijst van tree construct door met replacement rows uit x te kiezen -- [X] [tree_pred functie](https://github.com/Vinkage/2020_data_mining_assignments/blob/da8ca975fb9d11d3801fef66344736e675734c42/assignment1.py#L77-L103) met efficiente conditional branches -- [X] tree_pred_b een functie die een lijst van tree kan gebruiken om een voorspelling te maken voor rows in een data array x -- [X] Figure out how we want to compute the confusion matrix (scipy?) -- [X] Test prediction of single tree on pima indians data with nmin 20 and minleaf 5, check with confusion matrix in [link to assignment](http://www.cs.uu.nl/docs/vakken/mdm/assignment1-2020.pdf) -- [X] Test prediction on [educode](https://uu.educode.nl/login/?next=/submissions/646/feedback/95016) +This project is a from scratch implementation of a classification tree using +the theory we learned in the course. We were awarded a 10/10 for the coding +part, and 8.5/10 for the report, the code was said to be very readable, and we +also avoided time bottle necks using numpy. - -## Part2, data analysis: -- [ ] Datasets collecten uit de literature -- [ ] Datasets describen, exploren/plotten/formatten als het nodig is - -### Official steps - - -## The report - +The concepts of impurity reduction and the gini-index were used to construct an +algorithm that computes the "best split" at each step. We were also required to +implement ensembling methods. @@ -1,10 +1,10 @@ import numpy as np from sklearn import metrics -#- Names and student no.: +#- Made by: -# Hunter Sterk 6981046 -# Lonnie Bregman 6980562 -# Mike Vink 5585791 +# Hunter Sterk +# Lonnie Bregman +# Mike Vink #- Main functions: @@ -173,7 +173,7 @@ from sklearn import metrics # the classes vector. Note that when the number of 1 and 0 elements are # equal, it returns 0. -# EXAMPLE: +# EXAMPLE: # >>> y # array([0., 0., 0., 0., 0., 1., 1., 1., 1., 1.]) # >>> major_vote(y) @@ -221,7 +221,7 @@ from sklearn import metrics # (1.4285714285714286, array([False, False, False, False, False, False, True, True, True, # False]), array([ True, True, True, True, True, True, False, False, False, # True]), 36.0) -# +# # """ @@ -427,7 +427,7 @@ def tree_grow_b(x=None, def tree_pred_b(x=None, tr=None, true=None) -> np.array: """ The repeated application of tree.predict to construct a 2D array which is - used to make a majority vote label prediction for the rows in x. + used to make a majority vote label prediction for the rows in x. """ y_bag = np.zeros((len(x), len(tr))) for i, tree in enumerate(tr): |
