Lesson Learnt From Data Cleansing. Part II
Ok as I have done the part I I have learnt another thing from data cleansing. I believe the best tool for data analysis is R but the learning is quite steep but I’m trying to learn it right now. Hopefully if I can get some results from this with neo4j I could move on to use R to get some more statistical results.
Now, the second challenge I got was that the CSV file is about 1GB with around 100 columns and I was trying to use Ruby to parse that. Obviously, I got Killed results from my Terminal even before I got to the first line. So, Ruby has to load everything into memory and parse that. But as I’m running everything in my small Mac. It seems to be quite a challenging. So, what I have done is to dump everything to MySQL. This time works really good. MySQL has LOAD DATA FILE which can load CSV file into a table. It only took me around
Index joined columns and changed the table schema to something small. I used VARCHAR(50) instead of VARCHAR(255) and of course I used NUMBER(3) instead of VARCHAR(5)
Still, I couldn’t get the results back in 1 hour which I think MySQL is better than that.
So, I moved on to the next one. It looked like everybody on the Internet suggested key_buffer_size or key_buffer to 4G. And changed some of the variables to pimp MySQL up a little bit.
1 2 3
Still, I couldn’t get the results back.
And suddenly I caught something in my my.cnf
1 2 3 4 5
What is said was that XAMPP is tuned for a very small computer to be able to run basic PHP and MySQL application which is why my MySQL performed really bad in this case.
I started my EC2 instance with 7GB of memory and installed MySQL (normal one) also some of the performance tuning variables. Now, I got the results back in just under 20 seconds to join two tables with around 3 millions rows.
Lesson learnt: Do not use MySQL from XAMPP for data that is big and use query that would get smaller results. For example, in my case instead of filtering out user_ids that don’t exist in Users table. I filtered user_ids that exist instead.
I also thought about of using BigQuery from Google but I guess my data is rather small and can hardly be called big data. Next time, I might try to use BigQuery just for fun.