哔哩传媒

Text and Data Mining: Uncovering Hidden Data Points and Powering New Discoveries

L
Librarians
By: Guest contributor, Tue Aug 9 2022
_

Author: Guest contributor

The path to innovation requires the systematic analysis of millions of documents. But completing this process manually takes considerable time and effort. Text and data mining (TDM) enables researchers to speed up and enhance this work, allowing them to make new discoveries faster. In this blog, we look at how TDM works, what it means for librarians, and what 哔哩传媒 is doing to enable it.

The digital age has given us unprecedented access to information. Researchers can now obtain far more research into their subject areas than ever before. On the one hand, this is incredibly exciting, providing opportunities to make new discoveries by building on the incredible wealth of existing research. But on the other hand, it presents an overwhelming challenge 鈥 trying to analyse findings from the millions of academic articles published every year.

Even within niche subject areas, the sheer volume of papers, pre-prints, and data published is far too great for an individual researcher to stay abreast of. Yet, within this wealth of research could lie the answers to some of our biggest societal challenges. So how can researchers best use the information available to them?

While there are many options, one of the most promising areas being explored to make new discoveries and identify important patterns is text and data mining (TDM). TDM was the subject of a recent webinar presented by 哔哩传媒鈥檚 Director for Data Solutions, Dr. Prathik Roy. Dr. Roy described in detail how TDM is being used in the research community and what 哔哩传媒 is doing to support it.

What is TDM?

First, it鈥檚 worth taking a minute to explain exactly what TDM is. In short, it鈥檚 an automated process of selecting and analysing large amounts of text or data resources for purposes such as searching, finding patterns, discovering relationships, semantic analysis and more. This is done in a way that can provide valuable information needed for studies and further research.

The goal of TDM is to filter through information, identify pieces of data, and find the relationships and patterns among them. What is revolutionary is the ability of researchers to explore a dataset without knowing what specific questions to ask. Essentially, AI is now maturing from a role where it simply surfaces information to one where it can make recommendations and decisions, as well as generate content.

鈥淓ssentially what tends to happen is that these machine learning or AI algorithms go through the full text of articles and are able to classify the various aspects of each article,鈥 explained Dr. Roy. 鈥淔or instance, it will ask questions like, is it talking about a gene? Is it talking about a specific disease? Or is it talking about specific symptoms? And then it鈥檚 able to cluster the articles based on this.鈥

Once the algorithm has categorised articles in this way, it can then score the relationship between two types of categories. For instance, it could be used to assess the relationship between symptoms and a specific disease, by analysing how often that symptom is mentioned in relation to a disease. A high score 鈥 where there is a clear correlation between mentions of the symptom and mentions of the disease 鈥 could help identify the best drug to treat that disease. And this is just one example. TDM has a variety of uses across all fields.

Discoverability and pattern discernment

While TDM has a whole range of use cases, two of the most important right now are 鈥榙iscoverability鈥 and 鈥榩attern discernment鈥, as Dr. Roy described during the webinar.

The ultimate goal of discoverability, according to Dr. Roy, is to 鈥渕atch what you're looking for and then eliminate any irrelevant material from this discovery process.鈥 It should mean that when you鈥檙e searching for particular keywords or phrases, only highly relevant articles are delivered back in that search.

For example, say you were searching for articles that showed a link between carcinogens from tobacco and a specific type of cancer such as lung cancer. A 鈥榯raditional鈥 search could deliver you any number of articles that mention carcinogens, tobacco and/or lung cancer. Using TDM techniques, however, you could retrieve only those where specific carcinogens have an effect on the lungs.

The goal of pattern discernment, meanwhile, is to find patterns and trends across a dataset. The outcome of this will be hypotheses and predictions of likely prospects for therapy, material design, or strategy, as opposed to articles. For example, this technique could be used to match the biochemical properties of molecules to a viral protein's properties in order to identify a molecule likely to bind to the virus.

There are already many, many examples of where TDM can (or has already) made a significant impact in speeding up research discoveries and making the previously impossible possible. Just a few were touched on in Dr. Roy鈥檚 presentation, including:

There鈥檚 no doubt that there is huge potential for the future of TDM and what it could do to power new and innovative research.

What does this mean for librarians and information professionals?

As information professionals, knowledge workers and librarians, you have a long familiarity with managing and searching within large sets of information. It鈥檚 likely you鈥檙e responsible for evaluating and managing subscriptions to value-added online services, you identify and acquire specialized datasets for researchers, and you manage and make discoverable internal resources and collections.

This knowledge means you can bring a unique perspective to TDM projects 鈥 after all, you understand how information is used within your organizations, and you know how to make that information more discoverable and hence more valuable.

The value of TDM depends on knowing what sources to include, what kinds of connections to monitor and what types of metadata are necessary for a particular project. Again, librarians and info professionals bring the ability to ask the right questions, which enables them to see the larger context and identify the specific sets of information that would provide the richest insights.

For lots more insight on this topic, take a look at our whitepaper on TDM for librarians and information professionals.

哔哩传媒鈥檚 TDM tools

As the volume of scientific publications increases and TDM software tools improve, 哔哩传媒 has created a formalized process to enable TDM, with the aim to make it as simple as possible for researchers.

A growing number of 哔哩传媒鈥檚 journal articles are published open access. TDM is usually allowed without restrictions on these publications since the majority of 哔哩传媒 open access content is licensed under CC-BY.

Dr. Roy concluded his webinar presentation by giving an overview of the various tools 哔哩传媒 has created to facilitate TDM of our content. The key ones you need to be aware of are:

  • Metadata API: Metadata and abstracts for online documents (journal articles, book chapters, protocols, etc.)
  • Meta API: New versioned metadata for online documents with additional fields and links to source content.
  • Fulltext API for Open Access content: Fulltext content (where available) for 哔哩传媒 Open Access XML
  • Fulltext API for Open Access and pay-walled content (under license): Fulltext content (where available) for all 哔哩传媒 XML
  • Journal header data API: "journal-level" API that provides XML based on the Journal ID
  • Citations API
  • SN SciGraph APIs: Linked Data API (using SciGraph URLs) or Redirect API ().

You can access all the APIs mentioned above on. 哔哩传媒 is also participating in the and we recommend Crossref services for pan-publisher TDM.

Helpful resources

Interested in finding out more about text and data mining? Here are some useful links:

And don鈥檛 forget, you can also watch the webinar with Dr. Roy and download the presentation slides.

_

Author: Guest contributor

Guest Contributors include 哔哩传媒 staff and authors, industry experts, society partners, and many others. If you are interested in being a Guest Contributor, please contact us via email: thesource@springernature.com.