In Deanta's 2021 Survey - Trends in Academic Publishing 2021*, more than 50% of academic publishers stated that data science would have a "growing" or "significant" role in their business in the coming year. We talked to Anne Badinier from Deanta's own data science team to find out why.
Data science seems to be making waves in every industry it hits. Can you explain what it is and why publishing may be the perfect partner?
Well, data science is all about analysing the data that we have, discovering its properties and using that knowledge to either create or improve services.
In publishing that data is text. We have a huge amount of it, but it's complex. If we can categorize it and understand it, we can do a lot with it. That task led to a branch of data science called NLP (Natural Language Processing), which looks to extract information from texts including semantic information, the context and meaning, which is a key challenge for computers. If machines can get this task right then they can begin to generate new texts (like lay summaries) and perform simple editorial tasks.
Can you give me an example of the data you are working on and how it helps us to make data science-based predictions for our customers/publishers?
I’ve been working on a project for Pro-Editor, the real-time editing tool in our Cloud-based platform Lanstad. I have been looking through all of our data for books and journals (like metadata and raw Word data) and using data science techniques to break down each block of text to identify and categorise every component part.
We are using 2-3 million paragraphs of data (about 10,000 chapters) to find these features and then define them using rules. For instance, by analysing the variety of features in a text-based on where it is, what it sits next to (the paragraph before and after), and how it is styled, we can begin to define each component.
Once this model is accurate, we can use the rules we have created to identify these elements in any new text, ultimately helping us to make data science-based predictions.
But how can you categorise texts you don’t know – there must be hundreds of variations?
It’s true! But we build machine learning (ML) models to help us compute the features and then group the categories. The more data we have, the more the machine learns.
So, right now I am using a ML model called a random forest, which is a set of several decision trees. A decision tree is a sequence of tests (or questions) on the available features e.g., does my sentence have more than 10 words (yes or no answer). The path from the root to the leaf represents the classification rules. When combined with answers on other features, we can begin to define a category of data
If we were to cut the leaves of the tree high, then we would have fewer, less defined categories – whereas the lower down you cut the tree the more detailed categories we have until we have all the labels we are looking for. So far, we have identified around 30 different key components like title, sub-title, lists, table.
What’s the end benefit of this?
A simple example would be the automation of the layout and design of specific texts. When a machine can read a brand-new text (like a raw Word document) and correctly identify all of the elements, it can then arrange them and style them according to the rules of a pre-existing style sheet. So, in a matter of moments after just one click, a raw Word document can be ingested, an accurate InDesign file can be created and every element styled to look like the finished book.
And this style sheet would apply to the most detailed elements likes references, citations, titles etc, which must be created and styled in a very specific and consistent way?
Yes, and taking references as an example; once the ML can identify a reference, we go deeper into the decision tree analysis to identify the individual elements of a reference so that we can create consistencies in that specific component on a granular level. We want to identify groups of words e.g., author name, title, publisher, publication date… inside a single sentence.
So, the macro benefit is greater accuracy and speed?
Yes, this task would have taken an artworker hours to do and most likely would have been full of errors and inconsistencies.
So, will this ML remove the need for a human element altogether?
Not yet, and likely not ever! We build a degree of uncertainty into the model so that anything that doesn’t fit neatly into an ML category is highlighted and the human user is asked to accepted or edit the machine’s suggestion. The machine learns from these fringe examples and continues to refine the categorisation rules.
So far, we’ve focused on editorial benefits. How else is data science supporting the publishing eco-system?
Publishers themselves are using data science to help evaluate and curate their own content. Machine models are being fed a huge array of journals and books to read and are using "topic modelling" to find bodies of related work. Ultimately this will help editorial decisions - as editors can see the gaps that they might have in their own lists – but this also has the power to accelerate learning as we make connections across files of study.
Yes, data scientists might use a combination of Topic modelling and knowledge graphs to look at this problem.
Topic modelling is when texts are mined to find abstract topics and semantics structures, whereas knowledge graphs are used to build representations of inter-connecting topics and keywords. You might think of knowledge graphs as a cloud of words, with the main interest in the middle then a layer with related interests (in publishing this might be other articles with citations of the main paper), then another outer layer with more inter-related references. Machines can use these models to interlink descriptions of entities – objects, events or even concepts.
What can we learn from other areas of publishing using data science?
Trade publishers have been using data science very effectively for a while now, particularly with customer data. Sentiment analysis tools take customer reviews of the books and extract the words to determine whether the review was positive, negative or neutral. This analysis can then be a building block for creating lay summaries, reviews and ratings, which retailers (like Amazon) share with customers to make new recommendations.
At Deanta we are planning on doing something similar to help support the marketing of journals. We extract the keywords from a journal abstract then auto-generate tweets using the title and keyword hashtags. This will be a really useful discoverability tool for the publisher's sales and marketing teams.
Can you use any data to get the result you need?
For publishing, we have XML for most data science projects which really helps us understand the structure of the text. If a project depends on us understanding only the content and the context of the text (like sentiment analysis or topic modelling) then any text format is good enough.
So, we’re using machine learning not creating algorithms?
Yes, all the data science machine learning models you might need have already been created, no one really creates their own now, as there is a lot of statistical theory behind each model. So, the task of a data scientist can be broken down into three key sections
1. Choose which model works best for your data;
2. Pre-process the data in a way that will help the model predict accurate results;
3. Then fine-tune that model by changing and adding the parameters.
It’s this fascinating recipe of data, model and parameters which you need to research and experiment with to get good results. It's never a straight path for right or wrong. We have to consider different combinations of parameter values, and we might research 100 values for the first parameter, then a hundred more for another one, and then combine the results to see which produce the most accurate (and consistent) results.
So, you’re working with pre-existing models?
Absolutely, we are not re-writing the methodology each time. But you need to understand the statistical background and theory of models before you apply them. That way you know which model to choose because you understand what the results will mean for your specific data.
So, data science really is the new name for statistics?
Yes, but data science covers areas way beyond statistics, like deep learning, which use neural networks.
That said, neural network theory can be a bit of a black-hole! There is lots of research done on neural networks where the models are mathematically validated, but many are beyond human interpretation - the results just work, as if by magic, and we don’t really understand why. So, this branch of data science is beyond statistics which we cannot easily explain as yet.
You left your PhD to join Deanta I understand. What was your PhD focused on?
I was working on large graphs and trying to solve the NP Hard Problems like “The travelling salesman problem”.
The travelling salesman problem?
Yes! It’s a data science puzzle about a salesperson who is given a list of cities and the distances between each pair of cities. The puzzle is to work out what is the shortest possible route that enables the salesperson to visit each city once and returns to the origin city. This is classified as an NP-hard problem, which means that no machine can find an exact solution in a “reasonable” time given a large amount of data. Industries use approximate solutions when encountering such problems. For this particular problem, the exact solution is found by enumerating all possible permutations and taking the ones with the lowest distance, but that’s timely and expensive as the variations can be huge and if you add another city you have to do the calculation all over again.
So, I was using machine learning to try and find exact solutions to this problem in new ways.
HBR said data science was the sexiest job in the 21st century! Would you agree with that?
Well, I love the research part, you have to do a lot of U-turns and experimentation. There are 3 main stages in data modelling … Data pre-processing; data modelling; and post-processing (where you analyse the results). Usually, people liking the modelling part and fine-tuning the parameters, but to me, the most important stage is the pre-processing, because you want to have data that is clear and clean enough otherwise your model will never work.
The goal, obviously, is to have the highest accuracy so you have to do the research to find the right models – that’s what I really like about it.
On further reading, check our carousel 5 things you didn't know about publishing and data science.
To find out how Deanta are using data science to support our customers, get in touch today at firstname.lastname@example.org
*soon to be published(expected March 2021)