Allen's Blog

https://www.capallen.top

https://www.capallen.top/atom.xml (RSS订阅地址)

Churn or Not

Sparkify is a digital music service similar to Netease Cloud Music or QQ Music. Many of the users stream their favorite songs in Sparkify service everyday, either using free tier that places advertisements in between the songs, or using the premium subscription model where they stream music as free, but pay a monthly flat rate. User can upgrade, downgrade or cancel their service at anytime. This is a Customer Churn Prediction Problem , there are so many similar projects, such as WSDM - KKBox’s Churn Prediction Challenge competition from Kaggle, and a few helpful links are follows: Customer Churn Prediction using Machine Learning (How To) Prediction of Customer Churn with Machine Learning Customer Churn Prediction and Prevention Hands-on: Predict Customer Churn So, our job is deep mining the customers’ data and implement appropriate model to predict customer churn as follow steps: Clean data: fill the nan values , correct the data types, drop the outliers. EDA: exploratory data to look features’ distributions and correlation with key label (churn). Feature engineering: extract and found customer-features and customer-behavior-features; Implement standscaler on numerical features. Train and measure models: I choose logistic regression, linear svm classifier, decision tree and random forest classifier to train a baseline model and tuning a better model from best of them. It is worth mentioning that this data is unbalanced because of less churn customers, so we choose f1 score as a metrics to measure models’ performance. Quick Facts A mini subset of size 125 MB of the original 12 GB customer log json data file will be used for creating the prediction model. The small dataset has 286’500 log entries with 18 unique columns. The schema and info of dataset is given below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 root |-- artist: string (nullable = true) |-- auth: string (nullable = true) |-- firstName: string (nullable = true) |-- gender: string (nullable = true) |-- itemInSession: long (nullable = true) |-- lastName: string (nullable = true) |-- length: double (nullable = true) |-- level: string (nullable = true) |-- location: string (nullable = true) |-- method: string (nullable = true) |-- page: string (nullable = true) |-- registration: long (nullable = true) |-- sessionId: long (nullable = true) |-- song: string (nullable = true) |-- status: long (nullable = true) |-- ts: long (nullable = true) |-- userAgent: string (nullable = true) |-- userId: string (nullable = true) Column’s NameDescription artistThe artist being listened to authWhether or not the user is logged in firstName/lastNameName of the user genderGender of the user itemInSessionItem number in session lengthLength of time for current row of specific log levelFree or Paid user locationPhysical location of user, including City and State methodGet or Put requests pageWhich page are user on in current row registrationUsers registration number sessionIdSession ID songSong currently being played statusWeb status tsTimestamp of current row userAgentUseragent of post or get in browser of users userIdUser ID Exploratory Data Analysis We use the Cancellation Confirmation events of page column to define the customer churn, and perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. churn So, there are 52 users have churned events in the dataset, it’s about 23.1% churned rate. The rate of churn and not churn is roughly 1:3, so this is an unbalanced dataset. gender Can we say the gender has effect on Churn or not ? We calculate the p-value and result is 0.20 over 0.05, so, we can’t say like that. page We count each item in page column of different group and normalized data. Obviously, NextSong has accounted for most of customers’ events. Thumbs Up ,Thumbs Down , Home and Add to Playlist have effect on churn too. userAgent We extract the browser and platform of customers from userAgent column. Customers using safari and iPad have more proportion in churn. time We extract day-of-month, day-of-week and hour from ts column. Customers from churn group have more events after 15th in one month, and have less events in weekend. Feature Engineering On the basis of the above EDA, we can create features as follows: Categorical Features gender level browser platform Numerical Features mean,max,min,std of length of users numbers of each item in page (ThumbsUp … number of unique songs and total songs of users number of unique artists of users percentage of operations after 15th in a month percentage of operations in workday We implement label encoding on categorical features and standard scaler on numerical features. Modeling We split the full dataset into train and test sets. Test out the baseline of four machine learning methods: Logistic Regression, Linear SVC, Decision Tree Classifier and Random Forest Classifier. Though the LinearSVC spent more training time, but it can get the highest f1 score 0.702. And the LogisticRegression has a medium training time and f1 score, maybe I can tuning it to get a higher score. So I’ll choose LinearSVC and LogisticRegression to tuning in next section, the result is as follows: Linear SVC Training time: F-1 Score: LogisticRegression Training time: F-1 Score: As we can see in above, the logistic regression (0.7021) can get a nearly f1 score as the linear svm classifier(0.7045). But the logistic regression saves 83.3% time spending than the latter, considering this is only a quit mini dataset and our purpose is scaling this up to the total 12G dataset, so, the logistic regression is the best model from now on in this project. Conclusion Reflection In this project I set out to predict customers’ churn problem with the dataset of a music streaming service named Sparkify. This is a binary classification problem , so I choose four supervised learning algorithm to found a model. After evaluated and tuning, I find out the logistic regression is the suitable model for this project because of its balanced and high f1-score (0.7021) and time spending. By the way, I once fell into the trap of data leakage ,so that all of the models can achieve a performance that seems too good (1 for f1-score) to be true. I had to go back to check my feature engineering, and found I put the cancellation confirmation which is the flag of churn in the features, what a awkward thing! And that teach us you must be careful and patient when you are working. Improvement There are only about 76 samples in the mini dataset above, so the model could be improved by being trained on a bigger dataset and tuning hyper parameters based on it. Another improvement could be to try out more features or deep learning models. Github Repo Hope you find this interesting and for further details on this analysis like code and process followed would be available here.

Allen's Blog

Churn or Not

利用Flask与pyecharts搭建Dashboard

Python中查找字符串中第n次出现某字符的位置

基于OpenCV的表格识别

一个简单的去水印小工具

Airbnb in Beijing isn't as Cool Enough as it's Advertised

机器学习中不平衡数据的预处理

selenium爬取新榜公众号信息

利用python批量出报告

第三周-Numpy和Pandas基础

Anaconda&Jupyter Notebook配置

Python Code of Mini-Batch Gradient Descent

SQL应知应会

每周导学-第十二周-项目四

每周导学-第十一周-统计学进阶

每周导学-第十周-统计学基础

每周导学-第九周-P3阶段总结

每周导学-第八周-FBI枪支数据分析

每周导学-第八周-TMDb数据分析

每周导学-第七周-Matplotlib可视化

Allen's Blog

Churn or Not

利用Flask与pyecharts搭建Dashboard

Python中查找字符串中第n次出现某字符的位置

基于OpenCV的表格识别

一个简单的去水印小工具

Airbnb in Beijing isn't as Cool Enough as it's Advertised

机器学习中不平衡数据的预处理

selenium爬取新榜公众号信息

利用python批量出报告

第三周-Numpy和Pandas基础

Anaconda&Jupyter Notebook配置

Python Code of Mini-Batch Gradient Descent

SQL应知应会

每周导学-第十二周-项目四

每周导学-第十一周-统计学进阶

每周导学-第十周-统计学基础

每周导学-第九周-P3阶段总结

每周导学-第八周-FBI枪支数据分析

每周导学-第八周-TMDb数据分析

每周导学-第七周-Matplotlib可视化

Allen's Blog

Churn or Not

利用Flask与pyecharts搭建Dashboard

Python中查找字符串中第n次出现某字符的位置

基于OpenCV的表格识别

一个简单的去水印小工具

Airbnb in Beijing isn't as Cool Enough as it's Advertised

机器学习中不平衡数据的预处理

selenium爬取新榜公众号信息

利用python批量出报告

第三周-Numpy和Pandas基础

Anaconda&Jupyter Notebook配置

Python Code of Mini-Batch Gradient Descent

SQL应知应会

每周导学-第十二周-项目四

每周导学-第十一周-统计学进阶

每周导学-第十周-统计学基础

每周导学-第九周-P3阶段总结

每周导学-第八周-FBI枪支数据分析

每周导学-第八周-TMDb数据分析

每周导学-第七周-Matplotlib可视化