How Publishers Can Take Advantage of Machine Learning (Cloud Next '18)

my name is Navid Ahmad I am a senior director for data engineering and machine learning at Hearst Newspapers my responsibilities are to build the data pipeline data warehousing and using that data to build data products and predictive data science you might not have heard about Hearst but you might have heard of names like San Francisco Chronicle or SF Gate or Houston Chronicle or magazine names like Esquire or cosmopolitan or men's health so the company behind this is a Hearst it's one of the biggest media companies in USA I'm here to talk about the work we've done with Hearst Newspapers Hearst newspaper employs about 4,000 people across the country it focuses on local news and it has about 40 websites and in+ websites and it's continuously growing and you can find more information about Hearst at this link and these are some of the name brands that I just talked about you might be more familiar with San Francisco Chronicle newspaper in this California area so this is sort of like our data strategy at Hearst newspaper it might vaguely look like the mass love hierarchy of needs if so data is the essential thing without data you can do data science or business intelligence so our strategy was to first build a data warehouse and I'll take some minutes to talk about our what we did with bigquery and and top of it as its business intelligence which helps inform people in marketing editorial and product to make decisions on historical data and then using the same data that you can use for his to look at historical data you can predictive analysis and then also build products using that data talk a little about a big query so why centralized data is if you have all the data in one place you can connect the dots so if you have data sets of newsletters and Google Analytics and your content database you can make easily make connections versus if they were sitting in different database you'll have to you know it becomes difficult and it's efficient for users to go into one place to get that data there's no duplication you're not getting reports from one place and then to different places and those numbers don't match up and it's efficient and data warehousing use the same project management product engineering and QA and consolidate your efforts and most importantly relevant to this talk is it's the foundation for ml and why did we use bigquery before this we people used to either pull reports from different systems email them out there were different data bases different data warehouses so a couple of years ago we thought that we need to consolidate everything into one database and we chose bigquery and these are some of the reasons why bigquery was choice it's based on dream old technology and this stuff is on on their website terabytes and seconds and petabytes and minutes to be able to query that data it's fully managed sequel every lots of people know sequel so easy to get people onto bigquery and now with yesterday's release of bigquery ml you can even do machine learning on top of bigquery so this is the technology stack for our BI platform we're using ETL using air flow bigquery as a Darrow warehouse and looker and I think there's been other talks talking about how people have used looker with bigquery to build user journeys and stuff so we're also doing it that could be a separate dog but I'm just mentioning that we we are also doing this and these are some of the data sources we have in bigquery Google Analytics double-click subscriptions newsletters and this is continuously growing the the number of datasets in our data warehouse so getting to machine learning so I see machine learning and two big buckets one machine learning that's prepackaged that Google has done your data science work for you and they've exposed api's to be able to do and these are natural language processing auto ml which was also released in this conference speech video intelligence and vision and the other half is where your data science and your doing machine learning in-house using tensor flow / cloud ml bigquery ml and using data proc and spark so there was a case study published back in November of how Hearst Newspapers using natural language processing and the different use cases so the the two pieces that we've used from natural language processing is to be able to extract entities and categories from our text there's a part of speech API that we haven't used but this this alone we have about six use cases using these features so I took an article from San Francisco Chronicle this is about the this movie black panther and ran it through the web interface from Google and you can see that it has correctly detected Black Panther as a work of art it has a Wikipedia link it also has knowledge ID and also the saliency score and naturally the entities are broken down into different categories there is work of art places organization locations and then that is further broken down into proper and improper nouns and then it also correctly detected the right category for it and also given the confidence score so these are so using justice this is all good but how do we use this thing that you looks very simple so the I'm gonna go through some of these use cases across is how to use Google NLP you can from a simple use case of displaying these tags on our CMS system to building a recommendation system and also matching ads to the right content so I'll go one at a time on and this is sort of like the high level architecture of how everything connects so when context is ingested into our CMS system we make these NLP calls get the meta data back and store it into our database and then when the article is rendered these tags are part of the meta tag of the HTML and from that we have a third-party customer data platform which is extracting these NLP tags and into its system and at the same time there's a JavaScript call that for to double-click for publishers to be able to render ads and also this data is pushed into bigquery our CMS content along with the NLP tags are also pushed into bigquery upon which we can do bi reporting and then also build a recommendation system on it so first use case segmentation segmentation is that be able to identify a group of users say which group of users are sports readers versus like food and wine readers so you that our CDP platform has a built in mechanism to be able to use these tags to create segments and what how these segments are users that you can push marketing messaging like let's say if you if you know there's a segment above you know people who like to read tennis news you can push a newsletter using this tool the other use cases DFP ad targeting so I had mentioned that we are making in our key value pairs we can pass in a key of the NLP category and the actual category that was for that page so what this does is DFP over a month it can collect all this data and then one can run reporting on it telling how users and how ads were displayed across our content and then if another customer says hey I want to have a campaign to target Olympics web pages put my ad on Olympic web pages they can create a campaign with a certain criteria and now this vendors ads or because the partners ads can just be displayed on Olympics content so this is just a screenshot of how like the rules are set and DFP since this is a bi report it uses Google Analytics wcm content and the NLP data to come up with some useful numbers within this report it's showing for this category these are the number of users who visited and these are the number of articles in that category and a simple ratio could show that a higher ratio would mean that this is something editorial should focus on this category write more content about it and there's numerous different things you can do if you have google analytics NLP and content database for example you can create trend graphs about a certain topic or a certain personality how their their fame are they getting more famous or less famous over the years like content wise and this is another analysis of so we get content from third parties how which third parties give what type of content like does this source give us more sports content versus the other one and again you can build different kind of reports using this so I'll get into recommendation systems so why recommendation system important so next if you might have read Netflix values there is their recommendation that one billion dollars so in our context if we could reuse older content which is just sitting still and nobody is using it or looking at it that this could increase engagement people would stay on the website longer which means more ad revenue and eventually people might even subscribe right now people tend to go to the home page and whatever links over there are that's the content they can see but if we can use our recommendation system to explores older content that would be very useful so these are three different types of recommendation system that are supported by the google cloud so we did a content to content recommendation system personalized recommendation system and a video to content recommendation system so since we had the NLP data with the content data sitting and bigquery it was a very low food to build a recommendation system and the core concept of this as any two articles which have high overlapping NLP entities are related to each other and the this this is essentially a big SQL with certain rules and you know conditions that we created and runs on a periodic basis like every twenty minutes as new content comes in ran the sequel so let me show the diagram for this we run that sequel and bigquery store that and cloud SQL and it's fronted by a kubernetes web service layer to serve these recommendations and on from the front-end like there's a Java Script call to render these recommendations so and that API called you just they pass in the Content ID we find all the related content that's already pre computed in cloud SQL a Postgres database and render down so continuing on this concept so this was actually a hack day idea we had a hack day and I did this prototype to see we could extract anything useful from video so we took our videos convert extract the sound using a ffmpeg tool as open source and use that sound to make a call to the speech API which gives you back the text for that video and then you again you can run NLP on that text as well as combine it with the metadata for the video and you'll get NLP tags back so right now our again just like our text contest is just when we ingest the video content we're storing the metadata the transcript as well as the metadata tags along with it let me back though and what we did was that since we have these NLP tags this power is another recommendation system to recommend videos to text since our text already has NLP ties and our we've extracted these NLP tags from video now we can build an in-house recommendation system so this saved our company some money instead of buying like a video to text recommender we just build using NLP and search technology to build our own recommendation system so this project we worked in collaboration with our TV Department this is using tensorflow talk a little bit about what tensorflow is you might have already heard in neural networks are back and the current deep learning revolution is because of deep neural networks they're bigger and hierarchical and many of the Google products especially in the AI are based on tensor flow and cloud ml is a managed version of tensor flow so we build an in-house personalized recommendation system that could be like one whole talk talking about how the algorithm and how it works and the full architecture of it but it essentially in summary it's using scalar vector decomposition and basically a collaborative filtering algorithm and that algorithm is something that you can fit into tensor flow because you can solve it through great great using gradient descent and tensor flow library helps you solve algorithms that can be put into a gradient descent problem and it's basically looking at the user's history and also the history of people who are who have similar taste to it and there's a open source implementation Google released using Google Analytics CMS content and to build the recommendations and I'll encourage you to take a look at a bit similar to what we did and the high-level architecture is we're reading our content from content and Google Analytics data from bigquery and you see there's another advantage that all the data is sitting in bigquery do some pre-processing and then run this tensor flow model and then this tensor flow model gets stored that's the output of the tensor flow and it's fronted by tensor flow serving tensor flow serving is RPC layer that helps you deliver the recommendations on this model and then it's fronted by a restful web service layer some of the other use case for a tensor flow is to do propensity modeling forecasting content virality prediction build customer content class version of content so I actually also had churn modeling and that's a use case that I built in bigquery ml and I'll be talking about it so some of the now Google offers you a variety of ways of doing machine learning so it's a question of what's the most what makes sense is you have to figure out what makes sense to use tensor flow versus Bikram ml versus auto ml and all these other API so I gave a talk about Vickrey ml yesterday so why bigquery ml for Hearst Newspapers as I already told you that all our data is already sitting in bigquery so this was made a lot of sense to just do this using BQ ml I enabled anyone with familiar with SQL to get on board and start doing machine learning and the alternate would be to first learn R or Python learns a framework like scikit-learn or sir Flo but over here using bqm oh you don't need to learn any of those you don't need to ETL data out do a machine learning outside and then ETL stuff back in everything gets done in place and goodies like a normalization and one hot encoding it it it just does it for you and then you have other sequel syntax to get evaluation of your machine learning model right in bigquery so it's a relevant churn prediction it's a relevant use case for medium because you might have heard people have choices and it's hard to keep them keep the subscribers so if we could figure out a way or have some insight into the future of which subscribers are going to cancel we could say so I put up reverb money saved as money made or subscribers it is money made and I thought of two more yesterday I saw that I'll tell you another one so prevention is better than cure and one in hand is better than two in the bush so this just proves that I passed my English test this gives you insights into the future of cancellation of subscribers it's to class so it's a binary logistic regression people who cancel and who didn't and we're using we took one year of data of our subscription newsletter demographic web browsing all of that data again is sitting in bigquery so the architecture is really simple it's a really two nodes bigquery and looker but I put in a few steps of what happens so we're each yelling in all our data sets using air flow doing a little bit of pre-processing of our Google Analytics data like making some summary tables especially of how subscribers and a browsing set and then third step is the real machine learning step you just do create model the model name and then you give it a table with all of the columns with the features and one column is the label column and the label in our case is did this person cancel or not and then when you run this query it takes about four or five minutes you go and grab a cup of tea or coffee and when you're back the query number four is you run select star from predict on a table which has existing subscribers existing current subscribers and for those subscribers it'll give you a score of the probability score of how likely they were to cancel and then we built a bunch a dashboard and a bunch of looks in it which show the output of the result and also the output of our machine learning matrix such as ROC or precision and recall so this is a snapshot of a dashboard so the first look is basically showing all the subscribers sorted by how likely they are to cancel their subscription and somebody in the subscription retention team or can take a look at it and at least have an idea of what this is predicting how like do some forecasting this number on the right side is the AUC curve is called area under the curve score it's essentially the area under this other graph and it's a data science ething but what it means is that if this AUC score was 50% then our machine learning model hadn't learned anything meaning that this line over here for that's a plot of false positive versus too positive would just be a horizontal line but this shows a that our data has predictive power and to that be qml does the is learning something and on the right we're applauding the same graph which is the true positive rate and the false positive rate so another problem you have to figure out is what probability the probability threshold above which you say that this person is a gern or not do you say like 50% above or 30% and above so this graph over here on the bottom on the right side shows a plot of true positive and the false positive rate and we want to have a threshold which gives us a decent to positive and also a low false positive rate so we chose some about 0.3 which gives us about 18% false positive and 80% true positive rate which I feel is fine because even if you send out you know emails to the people you who you think are going to churn and if they don't end up churning it doesn't hurt you just send out extra emails and this is a plot of so you can also get the weights of your learned model to get an idea what it's learning so this is a plot of the weights these thing so features that are on the left are positively correlated to churn while the features on the right are negatively correlated returned and it gives us a sense that let's look at this feature there might be some clue in our historical data that we might be able to take some business decisions or change the way how we're doing business and and reduce our churn and it has two types of features one is just numerical features on one and are the others are categorical features so you have to do unnecessary qual to be able to get those other features and you can look up at the B qml tutorials with examples of other use cases to do predictive modeling so you auto ml for text was released in this conference so while this was an alpha being working with the product manager and built a model for ourselves so we have two main use cases that are the discussion about DFP being able to match ads we also want to enhance some of our categories for example take the sensitive subject category and break it down into more granular ones or like if there's a new sports that's not covered by the default categories we want to train our own so that's one use case the other use cases that I worked over here is to be able to detect evergreen content to be an evergreen content is content that has a longer lifespan for example a review of a restaurant or a museum will call it evergreen content but a story about some accident that happened in a mall it only has a few days or a few weeks life so you want to be able to differentiate content that's evergreen or not so initially I tried this open source data set from stumble upon from Kaggle very initially I try to write my own tensorflow code using LS TM I just quickly prototyped and felt that it it the data had predictive power but I knew that Google I had some hunch that Google was working on something like this so so so then we created a data set of our own data set to label evergreen content so it was sort of like internally crowd sourced through to our CMS a word editor is basically tagged some of this content as evergreen and to use auto ml is really easy you take a create a CSV file give it the first column as the content and the second column as the label everyone or not evergreen and then upload – Auto ml and have it run learned from this data so you see this we had about 3,000 articles for every being and another 3,000 for non evergreen and you see it has done a pretty good job and learning this thing I was surprised that it couldn't do so well and you see the precision recall score and this is the it also shows the confusion matrix of what it predicted it 91% of the time non evergreen content it predicted as no one ever green and only 9% of time it made a mistake for non evergreen content and for every being and did a perfect job and this is only on the test database so I took this and say I played around and tried in the auto ml console you can put in any random text and see how your prediction model does so I try to run our Hearst articles I tried on CNN articles and it seemed to work very well and then I even went to Wikipedia so I picked up an article about New York so that you think is evergreen and it corrected the 70% it predicted that is this is evergreen and then from again from Wikipedia I picked an article about Mexico elections and it predicted is non evergreen so it's basically the the combined knowledge of editorials of what they think is evergreen and non everybody in this auto ml has learned and is able to apply on text that it hasn't seen before so ideally we want to use bigquery for most of our analytic process anything that can be formulated as SQL I tend to want to use bigquery because it's distributed and it's more cost effective to do things in bigquery but there are some use cases that it's it like very few use cases that you have to do these things outside of bigquery so one of the use cases that we have duplicate content content which either their body looks very similar or the headline is tweaked and essentially it's the same article and we don't want to recommend articles that are similar to each other in a list of recommendations so and what we wanted to do initially was and bigquery like creative were to work of each article and then then do a across join with itself and create figure out the distance between those articles and doing this and BIC raid was not possible because it wouldn't that query wouldn't return so what we did is in this use case we used spark basically spawned up a cluster of 10 machines for a couple of hours and we wrote a PI spark job to compute these vectors and basically compute the cosine distance between articles and articles that had were very close to each other we eliminated them as duplicates this helps our recommendation system to remove any duplicate content and a use case of spark and data proc in our data so I'm getting close at this is actually my last slide so what's the future of Hearst Newspapers so we want to build more predictive models using bigquery and ml because it makes a lot of sense to since we all have all the day it's very easy to use bicker EML things like propensity modeling like the reverse of churn modeling is to figure out who are the subscribers that are likely to subscribe or who are the visitors on our webpages were likely to subscribe and this could help our marketing systems like focus on those and actually and yesterday's be QML demo the demo was about figuring out those people that are likely because future customers we want to production lies the the taxonomy I was talking about we already have requirements to an auto ml want to create our own data sets and train it to enhance our the NLP taxonomy there's been lots of research using deep neural networks for recommendation systems there's actually one article which lists all the different research papers using deep neural networks for recommendation system and we've been prototyping a few different approaches like these hybrid approaches there's all sorts of different flavors of doing recommendations systems using deep learning so that research is continuing and hopefully we could our version two will be a more even more advanced recommendation system there's still more juice to be taken out from Google NLP that we haven't fully utilized the NLP especially some of the use cases are being able to build personalized newsletters or topic pages using Google NLP so we are like thinking on these other use cases and how to production Eliza's and also we have a large corpus of images haven't yet reached that point to build a product around it so that's one of the things that I have for the future

One thought on “How Publishers Can Take Advantage of Machine Learning (Cloud Next '18)

Leave a Reply

Your email address will not be published. Required fields are marked *