Text and Data Mining: Uncovering Hidden Data Points and Powering New Discoveries

Librarians
Home

Products

Open science

Authors

Booksellers

Editors

Peer Reviewers

Librarians

Societies

Partners

Policies

Shop

What we do

Contact

Careers 鈫�

哔哩传媒 Group 鈫�
Products
Journals

Springer journals

Nature Portfolio journals

Adis journals

Academic journals on nature.com

Palgrave Macmillan journals

Journal archives

Open access journals

eBooks

eBook collections

Book archives

Open Access Books

Proceedings

Reference Modules

Textbooks

Databases & Solutions

AdisInsight

Data Solutions

protocols.io

SpringerMaterials

SpringerProtocols

哔哩传媒 Experiments

Corporate & Health

哔哩传媒 Video

Services

Research Data Services

Nature Masterclasses

Products overview
Licensing
Academic, Government & Corporate

Journals

eBooks

Databases

Request a trial

Request a demo

Request a quote

Corporate & Health solutions

eBook and Journal collections

Content on Demand (CoD)

Quote, trial or demo

Journals catalog

Serials Update

New Starts

Take-overs

Publishing Model Changes

Cessations & Transfers

Licensing A-Z

How it works

Talk to a Licensing Manager

Docusign

Digital preservation

Licensing overview
Open Research
Tools & Services
Implement

Discovery at 哔哩传媒

MARC Records

Librarian Portal

Remote Access

Promote

Content Promotion

Library Promotion

Learn

Tutorials & User Guides

Webinars & Podcasts

White papers

Support your users

Evaluate

Account Development

Usage reporting

Tools & Services overview
Blog
News & Initiatives

System Updates

Events

Journals

eBooks

Corporate & Health

Databases & Solutions

Resources

All posts

Overview Page
Contact
Stay informed

L

Librarians

By: Guest contributor, Tue Aug 9 2022

Author: Guest contributor

The path to innovation requires the systematic analysis of millions of documents. But completing this process manually takes considerable time and effort. Text and data mining (TDM) enables researchers to speed up and enhance this work, allowing them to make new discoveries faster. In this blog, we look at how TDM works, what it means for librarians, and what 哔哩传媒 is doing to enable it.

The digital age has given us unprecedented access to information. Researchers can now obtain far more research into their subject areas than ever before. On the one hand, this is incredibly exciting, providing opportunities to make new discoveries by building on the incredible wealth of existing research. But on the other hand, it presents an overwhelming challenge 鈥� trying to analyse findings from the millions of academic articles published every year.

Even within niche subject areas, the sheer volume of papers, pre-prints, and data published is far too great for an individual researcher to stay abreast of. Yet, within this wealth of research could lie the answers to some of our biggest societal challenges. So how can researchers best use the information available to them?

While there are many options, one of the most promising areas being explored to make new discoveries and identify important patterns is text and data mining (TDM). TDM was the subject of a recent webinar presented by 哔哩传媒鈥檚 Director for Data Solutions, Dr. Prathik Roy. Dr. Roy described in detail how TDM is being used in the research community and what 哔哩传媒 is doing to support it.

What is TDM?

First, it鈥檚 worth taking a minute to explain exactly what TDM is. In short, it鈥檚 an automated process of selecting and analysing large amounts of text or data resources for purposes such as searching, finding patterns, discovering relationships, semantic analysis and more. This is done in a way that can provide valuable information needed for studies and further research.

The goal of TDM is to filter through information, identify pieces of data, and find the relationships and patterns among them. What is revolutionary is the ability of researchers to explore a dataset without knowing what specific questions to ask. Essentially, AI is now maturing from a role where it simply surfaces information to one where it can make recommendations and decisions, as well as generate content.

鈥淓ssentially what tends to happen is that these machine learning or AI algorithms go through the full text of articles and are able to classify the various aspects of each article,鈥� explained Dr. Roy. 鈥淔or instance, it will ask questions like, is it talking about a gene? Is it talking about a specific disease? Or is it talking about specific symptoms? And then it鈥檚 able to cluster the articles based on this.鈥�

Once the algorithm has categorised articles in this way, it can then score the relationship between two types of categories. For instance, it could be used to assess the relationship between symptoms and a specific disease, by analysing how often that symptom is mentioned in relation to a disease. A high score 鈥� where there is a clear correlation between mentions of the symptom and mentions of the disease 鈥� could help identify the best drug to treat that disease. And this is just one example. TDM has a variety of uses across all fields.

Discoverability and pattern discernment

While TDM has a whole range of use cases, two of the most important right now are 鈥榙iscoverability鈥� and 鈥榩attern discernment鈥�, as Dr. Roy described during the webinar.

The ultimate goal of discoverability, according to Dr. Roy, is to 鈥渕atch what you're looking for and then eliminate any irrelevant material from this discovery process.鈥� It should mean that when you鈥檙e searching for particular keywords or phrases, only highly relevant articles are delivered back in that search.

For example, say you were searching for articles that showed a link between carcinogens from tobacco and a specific type of cancer such as lung cancer. A 鈥榯raditional鈥� search could deliver you any number of articles that mention carcinogens, tobacco and/or lung cancer. Using TDM techniques, however, you could retrieve only those where specific carcinogens have an effect on the lungs.

The goal of pattern discernment, meanwhile, is to find patterns and trends across a dataset. The outcome of this will be hypotheses and predictions of likely prospects for therapy, material design, or strategy, as opposed to articles. For example, this technique could be used to match the biochemical properties of molecules to a viral protein's properties in order to identify a molecule likely to bind to the virus.

There are already many, many examples of where TDM can (or has already) made a significant impact in speeding up research discoveries and making the previously impossible possible. Just a few were touched on in Dr. Roy鈥檚 presentation, including:

There鈥檚 no doubt that there is huge potential for the future of TDM and what it could do to power new and innovative research.

What does this mean for librarians and information professionals?

As information professionals, knowledge workers and librarians, you have a long familiarity with managing and searching within large sets of information. It鈥檚 likely you鈥檙e responsible for evaluating and managing subscriptions to value-added online services, you identify and acquire specialized datasets for researchers, and you manage and make discoverable internal resources and collections.

This knowledge means you can bring a unique perspective to TDM projects 鈥� after all, you understand how information is used within your organizations, and you know how to make that information more discoverable and hence more valuable.

The value of TDM depends on knowing what sources to include, what kinds of connections to monitor and what types of metadata are necessary for a particular project. Again, librarians and info professionals bring the ability to ask the right questions, which enables them to see the larger context and identify the specific sets of information that would provide the richest insights.

For lots more insight on this topic, take a look at our whitepaper on TDM for librarians and information professionals.

哔哩传媒鈥檚 TDM tools

As the volume of scientific publications increases and TDM software tools improve, 哔哩传媒 has created a formalized process to enable TDM, with the aim to make it as simple as possible for researchers.

A growing number of 哔哩传媒鈥檚 journal articles are published open access. TDM is usually allowed without restrictions on these publications since the majority of 哔哩传媒 open access content is licensed under CC-BY.

Dr. Roy concluded his webinar presentation by giving an overview of the various tools 哔哩传媒 has created to facilitate TDM of our content. The key ones you need to be aware of are:

Metadata API: Metadata and abstracts for online documents (journal articles, book chapters, protocols, etc.)
Meta API: New versioned metadata for online documents with additional fields and links to source content.
Fulltext API for Open Access content: Fulltext content (where available) for 哔哩传媒 Open Access XML
Fulltext API for Open Access and pay-walled content (under license): Fulltext content (where available) for all 哔哩传媒 XML
Journal header data API: "journal-level" API that provides XML based on the Journal ID
Citations API
SN SciGraph APIs: Linked Data API (using SciGraph URLs) or Redirect API ().

You can access all the APIs mentioned above on. 哔哩传媒 is also participating in the and we recommend Crossref services for pan-publisher TDM.

Helpful resources

Interested in finding out more about text and data mining? Here are some useful links:

Text and Data Mining at 哔哩传媒
Bringing Insight to Data: Info Pros鈥� Role in Text and Data Mining
(all our API offerings with key information, examples and API key sign-up)
Can AI help us manage information overload?
AI and science publishing: 鈥渃utting through the clutter has never been more important鈥�

And don鈥檛 forget, you can also watch the webinar with Dr. Roy and download the presentation slides.

Author: Guest contributor

Guest Contributors include 哔哩传媒 staff and authors, industry experts, society partners, and many others. If you are interested in being a Guest Contributor, please contact us via email: thesource@springernature.com.

Related Tags:

哔哩传媒