Assignment 2 Developing a Recommender for MovieLens100K

Communicating the business value and potential utility of data insights and a recommender model. Comparing linear regression, tree, ensemble, clustering and SVD model outputs

profile picture of author
Author
Joshua McCarthy
Published
Sun, May 24 2020
Last Updated
Sun, May 24 2020

Context

A different format and a different audience but the same communication goals remain.

submit a management presentation on your approach. You should reduce and alter your report to align to this different audience. It is crucial to ensure that your presentation is appropriate for senior management that are largely non-technical in background.

The board are reasonably tech savvy and have listened to analyst presentations before. However, they have not undertaken any work on recommendation engines or machine learning.

Business Brief

You are working as a Data Scientist for an online streaming entertainment company. The content management executive team has hired you to look at their data and build a recommendation engine that is able to recommend movies to their users that they will enjoy. The board is very interested in getting the right content to their users as well as any insights you have about their consumer’s current watching habits. There has been very little deep analytical work done on this data so far.

As a team, we took a stepped approach to predicting a user's star rating of a movie starting with feature engineering and linear models progressing through tree and ensemble methods, our best results delivered through an SVDF model. After completion of the report we were required to produce presentations based on our work and outcomes individually.

During the team activities I emphasised the importance defining business outcomes and relating our work and discoveries back to these outcomes. We refined the intial breif into a challenge statement refining our goals, then further breaking down the challenge into key components and more finite deliverables that addressed the company's needs. We identified the core / largest user segment represented in the data, where effort could then be directed and importantly, where effort could be directed toward diversifying this audience and attracting new customers.

Business Opportunities

Refining value creation for our users

  • Identify potential future content that appeals to user needs creating value
  • Improved user experience and engagement by providing relevant and valuable content leading to improvements in retention and reductions in churn
  • Increasing the value an item of content delivers to a customer
  • Increasing the range of content able to be offered to customers without negatively impacting the experience of another user by providing content not significant to that user.
  • Supplying a greater range of content on the platform encouraging engagement with underrepresented groups without sacrificing the experience of existing users.
  • Utilizing the recommendation engine to ensure content is served to interested users, existing users will not be adversely affected by the increase in variety
  • Utilising the recommendation engine to identify content gaps, specifically for the core user base, and targeted user groups.

Redefining the method of value capture

  • Utilising useful recommendations to increase the total viewing time allowing the provision of additional advertisements
  • Increasing the perceived value of an individual piece of content to reduce the need to overprovision content for a user, reducing costs
  • Cross-promotion through the identification of content that resonates with core user base and the production of original content

Presentation

The presentation works to visualise this process and illustrate how this is achieved. Slide notes are included below each slide which outline the details to be communicated verbally.

Slide 1, Challenge, How might we understand our customer’s engagement with our content to guide future content selection and refine our value proposition?

Slide 1, Challenge, How might we understand our customer’s engagement with our content to guide future content selection and refine our value proposition?

Challenge

The challenge was developed working with the content management team to understand the;

  • Goals of the project
  • Business objectives
  • User needs

The project looks to leverage data science tools to deepen our understanding of how customers engage with our content and how we enable an easier form of this experience. The business outcomes are addressed in the ability to deploy these learnings and tools into methods of improving the value proposition of the content delivery service.

Problem / Challenge

Slide 2, value proposition canvas, value creation linked to value capture

Slide 2, value proposition canvas, value creation linked to value capture

Value prop Starting with customer side;

  • Jobs to do – why is a customer turning to our platform
  • Gains – Why benefits are available to the customer
  • Pains – what has a negative effect on the customer’s experience Business Side
  • Products & Services – What can we offer enable customers to “complete” their job
  • Gain Creators – How do we enable customers to see those benefits
  • Pain Relievers – How do we mitigate these potential negatives

Jobs and products and services are the core business methods of value creation and value capture Gain and pain related sections are value adds that improve the user experience, increasing the value of the product or service this where the recommendation system is most effective

Further slides will detail how the recommendation system addresses these opportunities

What do we know broadly about our users?

Slide 3, classic demographics, gender, age, top content and past performance

Slide 3, classic demographics, gender, age, top content and past performance

Solution / Opportunities

Slide 4, linking specific opportunities to modes of value capture

Slide 4, linking specific opportunities to modes of value capture

  • User Understanding – Through data collection, and enabling the use and or sale of that data
  • Original Content – Product Placement – Using user understanding and predictions to create original content for the platform, opens a channel to retail product placement spots
  • Content Diversification – Attract new users through the identification of underserved demographics and user types, the content these groups are interested in deploy a strategy to engage them
  • Content Curation – Content diversification could have an adverse effect on the experience of existing users as they would otherwise be exposed to increasing amounts of content outside of their interest
    • Using the recommendation system to curate enables a personalised experience for each user, not only reducing this risk, but improving a customer access to relevant content, reducing paralysis of choice
    • Provides an overall enhanced user experience
  • Content Diversification – Attract new users through the identification of underserved demographics and user types, the content these groups are interested in deploy a strategy to engage them
  • Content Curation – Content diversification could have an adverse effect on the experience of existing users as they would otherwise be exposed to increasing amounts of content outside of their interest
    • Using the recommendation system to curate enables a personalised experience for each user, not only reducing this risk, but improving a customer access to relevant content, reducing paralysis of choice
    • Provides an overall enhanced user experience
Slide 5, pulling out more granular user groups from core users, by occupation, then location observing prefered genres

Slide 5, pulling out more granular user groups from core users, by occupation, then location observing prefered genres

What can we learn about our users at a more granular level

User Understanding – Through data collection, and enabling the use and or sale of that data

  • Take through the discovery journey
  • We have a demographic of male students from Minnesota, they enjoy drama action and comedy movies
  • What else can we learn digging through our data?
  • We must be careful how this data is used, as it can be quite specific
  • Data is a valuable asset, with customer permission it can be sold
  • A/B testing and user surveys can be conducted to improve the depth of understanding available in the data Original Content – Product Placement – Using user understanding and predictions to create original content for the platform, opens a channel to retail product placement spots
  • Now we know what our users like we can make content
  • What products might our Minnesota resident want to see in their content?
  • We can market this original content with the demographic information that is likely to engage with it to companies interesting in investing

N.B. Males were already over-represented in the data but that executives stat, not great

Content Diversification (attract new users) & Curation (increase retention)

Slide 6, connecting the right users to the right content based on system predictions

Slide 6, connecting the right users to the right content based on system predictions

Content Diversification

  • An underrepresented group, female administration workers living in Ohio
  • These users enjoy romances
  • By increasing the quantity and quality of romances available we might be able to increase our market share within this usergroup
  • However, our Minnesota man does not enjoy romances, if we start increasing this content we may lose these users as they struggle to find content they enjoy

Content Curation

  • We can use recommendation to ensure that we connect the right people to the right content and avoid links that will have a negative effect on our users experience
  • Using the recommendation system to curate enables a personalised experience for each user, not only reducing this risk, but improving a customer access to relevant content, reducing paralysis of choice
  • Provides an overall enhanced user experience

Increase Engagement & Value Delivered

Slide 7, how understanding users can increase engagement and revenue, using recommender to prune content and reduce cost

Slide 7, how understanding users can increase engagement and revenue, using recommender to prune content and reduce cost

Increased engagement (advertising)

  • Oh Minnesota resident is back, it’s pretty cold up there
  • He might want a warm jumper?
  • By understanding this about our user we can serve relevant advertisements
  • This not only improves our customers experience with likely unwanted advertisements
  • It enables us to offer targeted advertising as a service to advertising partners
  • Further if we identify content likely to prolong a users engagement with our platform
  • We increase our opportunities to serve advertising without being intrusive
  • Further increasing revenue

Value delivered (cost reduction)

  • By identifying high value content (highly rated) and removing less valued content
  • We can reduce the data storage and bandwidth overheads without reducing the experience of our uses
  • This will not only optimise our backend, but start to identify levels in high value content
  • As overall content rating improves, previously highly rated content, that was potentially not quite as good as similarly rated content will be easier identify as ratings spread out

Modelling Success

Slide 8, modelling outcomes, svdf model best at 0.91 star error, recommendations, likely four of five star rating, available for 92% users, model performance comparison using RMSE

Slide 8, modelling outcomes, svdf model best at 0.91 star error, recommendations, likely four of five star rating, available for 92% users, model performance comparison using RMSE

  • The model was top 5 among competitor models and can be further optimised to improve accuracy
  • The ‘Funk SVD model is a type of collaborative filtering model, a model that generates predictions based on known ratings of similar users
  • The model uses Singular Value Decomposition, meaning that it attempts to quantify and estimate the unknown factors that explain user’s relationships
  • This model was developed by Simon Funk for the Netflix recommendation challenge
  • A variety of models were tested, accuracy was fairly consistent, the challenge was that models were overfitting, internal cross validated test set error rates were as low as 0.7 however once deployed on the external test set error rates increased to those seen here
  • Accuracy of model, best RMSE 0. 90944
  • Predictions of 4 and 5 stars
  • We have identified opportunities to recommend content to most user of our users
  • Recommending content they will enjoy will improve the user experience on the platform increasing engagement and likely resulting in improved word of mouth uptake.
  • Distributions of recommendations, a lot of users with a few potential movies, a few users with a lot

Ethics, Legal, Privacy & Next Steps

Slide 9, privacy and ethics, storage, ethical usage, reidentification, next steps, individuals over demographics, views vs rating, future content, classification

Slide 9, privacy and ethics, storage, ethical usage, reidentification, next steps, individuals over demographics, views vs rating, future content, classification

Ethics, Legal, Privacy

  • Secure data storage and distribution practices, distributing unencrypted, unprotected data to teams is high risk
  • Identified opportunities can be met with ethical risk
    • Targeted advertising can exploit users and cause unnecessary expenditure
    • Increased engagement can result in addiction and unhealthy content consumption habits
    • Data collection and user investigation can identify unintended traits, how this understanding is deployed also presents a risk
  • Anonymity is relative, if the data is widely distributed there are a variety of methods that can be used to reidentify the data, posing potential risks to our users, exemplified in the reidentification of users from the Netflix challenge dataset

Next Steps

  • User similarity rather than demographics
    • Instead of targeting “males” we can start to move toward targeting groups of users with similar interests
    • Enables improved communication and marketing
  • High rating != most watched
    • We need to be wary of using only ratings predictions, as while this can predict if a user will enjoy the content it does not necessarily related to how long or how often they engage with the platform
    • Further investigation can be undertaken to build a similar recommendation system based on time engaged with content
    • We may inadvertently negatively affect overall user engagement
  • Future content selection model
    • Similarly, a recommendation system can be developed and deployed to review upcoming and recently released content and weather or not it will add value to the platform
      • This process will be more accurate if early reviewers ratings and their similarities to our userbase can be identified
      • Otherwise data can be collected directly from review sources
      • Helps the content management team’s core function
  • Classification – What is a good movie? To a user
    • An alternative approach available is to reformulate the target from that of rating to weather a user will enjoy a piece of content
    • This can change how easily useful content can be identified and how tradeoffs are made with incorrect predictions however does not provide the detail of the alternative
    • This model can be further improved by investigation into what our customer identify as “good” content;
      • First can we identify the cutoff in rating for good content amongst our users? Are they harsh or kind? We may be able to develop this specific to each user through further research or user survey.
    • Is it purely star rating? Or are other factors involved such as length etc.
    • This furthers our ability to ensure a positive user experience on the platform and continued engagement