Bank marketing data set is a very well known data set available at:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing .
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be ('yes') or not ('no') subscribed. We will work with the older version of the data set (which does not contain the economic parameters).
Clearly, this is a classification problem and we are interested in identifying the clients that are likely to be subscribers (the yes class).
In this post, we are going to focus on Data Exploration. The next post will discuss some basic classification techniques.
First things first, we looked at data distribution - overall and within the subscriber population. The Response Rate (% subscribers) is 11.70% and like most marketing data sets, this one too is unbalanced.
Some other observations include,

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing .
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be ('yes') or not ('no') subscribed. We will work with the older version of the data set (which does not contain the economic parameters).
Clearly, this is a classification problem and we are interested in identifying the clients that are likely to be subscribers (the yes class).
In this post, we are going to focus on Data Exploration. The next post will discuss some basic classification techniques.
First things first, we looked at data distribution - overall and within the subscriber population. The Response Rate (% subscribers) is 11.70% and like most marketing data sets, this one too is unbalanced.
- 52% of the clients who subscribed were married
- 99.02% of the clients who subscribed did not have any credit default
- 63% of subscribers did not have housing loans. 93% did not have a personal loan (Low risk investors)
- 82.61% of subscribers were contacted via cellular
- 71.33% of contacts made were between May and Aug. Within the subscribers, 52.68% of contacts were made between these months.
- 63.98% subscribers were not contacted for past campaigns, so, Poutcome for those is unknown for these subscribers
This data is ordered from May 2008 to Nov 2010 but there is no year variable in the data set, so we added it. We can see that the conversion rate in 2010 was much higher than previous years although the no. of calls reduced.
We also derived and added day of the week variable from the date, month and year variables.
Duration:
When duration = 0, y = no. There are 3 records where duration = 0
Median duration of the last contact is higher for subscribers than for non-subscribers. However, to keep in mind that for a subscriber, duration is not known till the call is over.
Null values:
There are no null values but some categorical variables have a category called unknown. We will treat these as a valid category for now.
Reference:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
No comments:
Post a Comment