StatQuest: Random Forests Part 2: Missing data and clustering

65 thoughts on “StatQuest: Random Forests Part 2: Missing data and clustering

  • Basically, Random forest is very powerful in many aspects.And one of the main key factor is the Proximity matrix, which inturn opens many doors to the solution. Great to know!!! Thank you once again Joshua. Great explanation with great efforts! Bam!!

  • Awesome video Joshua… Just a question, is there any minimum number of observation recommended for RF?

  • Thanks for the excellent lecture! How about using a decision tree or random forest for cluster analysis in unsupervised learning?

  • can we use random forest FIRST to predict the missing data points and then use complete data to train the model for our main outcome of interest?

  • Thank you very much, sir.
    Both videos about Random forests were extremely helpful! I loved the clear explanation and all the jokes at random moments.
    Keep up the good work!!

  • May I assume the procedure is the following: first, you need to fit the model by ignoring the missing data; then you apply the algorithm discussed in this video to impute the missing; then use the new data set with imputed missing data to refit the RF model?

  • Thanks, this was super useful!! I think it would have been pretty hard to fully grasp these concepts just by reading about them. The clustering part was especially interesting.

  • can random forests be used when final predictions are not binary..meaning heart disease yes or no or maybe values?? in such non binary classifications, can we use trees or forests?

  • Hi Josh, Could you help me with which package allows for the imputation with RandomForest by the exciting algo you mentioned? MICE works apparently. Any other options available?

  • Hi Josh! Thanks for this great video. I have two follow-up questions though. I am a bit confused about why we need several iterations and what exactly is changing from iteration to iteration. Do we actually rebuild the entire random forest after each iteration by using the imputed values from the previous iteration? If so, I imagine this would change the weights based off the proximity matrix and as a result the imputed values. Second, can we use this approach or something similar beyond random forests? If not, are there other well-established approaches to impute missing data for algorithms like linear regression, logistic regression, boosting models, etc.? Thank you so much for your awesome work.

  • I believe your videos excel almost all related videos!! This is a strong source of reference to learn the core concepts, and I am eager to see more of those. QUADRA BAMMM!

  • in the last min of video (missing value 2nd) we put yes/no because it had categorical options but what if we had numerical choice at that base instead of categorical of yes/no

  • As usual your videos are highly entertaining and informative. I do have a tiny bit of query though; why did you take 0.1 and 0.8 for the weight of No. I mean in the Proximity matrix 0.8 is in 3rd row and 4th column, which is where the missing value is placed.

  • Excellent video. I was wondering about these for a while. Now I got some answers. Can you point me to some reference where I can know more about the mossing data in test instance. If possible, can you tell me which combination method Sklearn Random forest use? They have referenced Brieman's paper but it seems different to me. Thank you for your great work.

  • Thanks for the wonderful lecture video. But I feel there is a calculation mistake on sample's weighted average weight. Kindly check and share your feedback on the same.

  • Thanks for the great video. One question: how do you deal with missing data in the NEW sample that you want to classify?

  • Excellent tutorial … One question .. Do we calculate the similarity and of add to the proximity matrix every time .. or only with the sets that are similar to the missing data set … that is will we add +1 to the (1,2) and (2,1) if the 1st and 2nd data-sets appear to be the same in a random forest .. though the missing data is the 4th one ?

  • Josh you should create one complete machine learning course on coursera or udemy platform i will definitely buy it ..and i am sure there many people like me ..

  • Hi Josh ! Thanks for these videos. Do you think that one day you will make a video on Multiple Factor Analysis ? Let me know if yes !

  • I really like all of your videos, But it will be very helpful If you can show the implementation in Python or R in parallel

  • This aproach is very interesting. Any idea if there is a python implementation of this? I can not much information about it. Conratulations on all the series.

  • At 7:29, every time we will create a new random forest with the improved values we got from the previous result or will we use one random forest built at the start with initial guess of values at every step?

  • At 10:27 you say we make iterative guess to make a good guess for missing values but before we had multiple samples to find proximity matrix but in this there are no samples to compare to. So how do you that???

  • 19:39 : The missing value is "Blocked Arteries" which only has 2 options (YES or NO) so there are only two copy of data.
    If the missing value is on "Weight" (a numeric data), will there be a lot copy of data with every possible value of "Weight" ?

  • Hello, I have two questions after watching this video. (Good channel btw)
    I hope someone have answers
    1. If I have 1 million sample, then the size of my proximity matrix will be 1 million times 1 million. Is it correct? Because if it is, then this method of filling missing values would take a lot of time, I suppose.

    2. For missing data in a new sample, what if the target value is continuous? (Like decision tree regression)
    In this case, duplicate the new sample seems a bit odd.

  • Amazing! And I agree with @Himashu Mishra's comment, could you create some video explaining Algorithm with Python implementation from scratch.

  • At 3:30, you say that sample 2 3 and 4 all have the same leaf node. But sample 2 does have heart disease, and 3 and 4 don't! Doesn't that signify different leaf nodes?

  • 10:28 How can we use that iterative method for this particular test case? Assuming we don't have labels for the test set, we can't create the proximity matrix for the test set, right? and as this particular test case wasn't in training set we can guess the missing values using the proximity matrix we computed earlier. So, how exactly do we use the iterative method to make a good guess about missing values in test cases?

  • @Josh Do you know how to write python code for this???? I am stuck with a project with missing values. Any help is much appreciated.
    Thanks in advance!

  • As always, thank you for creating the video! I still have a minor confusion at 10:30. After reading some of the comments and your feedback, now I know that the test samples are merged with training samples to do the iterative imputation of missing values, but I'm still a bit confused why the need for using 2 copies since the iterative method doesn't rely on the target variable, and for prediction, we can simply use majority vote again.

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *