◆ In the following descriptions, the letter following each “Task” (e.g., Task A) corresponds to the codes (A–Q) listed under "Outsorcing task(s)" in “Contant Us.
In both profit- and non-profit activitiies, across all disciplines, project success critically hinges on whether one can generate more novel, unique, and competitive outcomes—faster and at lower cost—than competing efforts. To achieve this, it is essential to continuously collect and analyze comparable information from other projects with similar aims or scopes throughout the entire project lifecycle, from planning to completion. Based on such comparative insights, it becomes imperative to evaluate the validity of one’s own project implementation and, when necessary, to promptly and flexibly revise its direction.
Using A2K technology, comprehensive English texts related to user-specified focus terms (e.g., "extraterrestrial organic compounds," "BRCA1 tumor suppressor gene") are retrieved and automatically analyzed for fundamental sentence structures (such as SVO, SVC). Sentences containing target terms are then precisely and rapidly summarized into concise lists.
Because A2K processes large volumes of English text (from online content or digital files) in bulk, it enables exhaustive and efficient collection and aggregation of relevant information.
A2K outputs are formatted as “A2K Descriptions”, which consist of three core components: Subject, Action, and Process.
For instance, if an astrophysicist is investigating the terms “Hayabusa,” “Ryugu,” and “organic,” numerous documents may be retrieved by A2K analysis. One such document (link) spans *********** pages containing *********** words. A2K technology condenses this entire document into approximately 150 A2K Descriptions, significantly reducing the time and effort required for researchers compared to traditional manual reading.
Below is an example A2K Description derived from the following sentence:
“Hayabusa2 spacecraft collected surface regolith particles from Ryugu in two separate touchdown events, which were stored in collection chambers A and C of the spacecraft’s sample catcher.”
This yields the following A2K Description:
Subject: "Hayabusa2 spacecraft" »
Action: "collected" »
Process: "surface regolith particles from Ryugu in two separate touchdown events"
Since A2K/LA2K technology relies on computational processing, it cannot interpret texts with the same precision as human experts.
For example, a document (here) contains the following sentence:
“However, all CI chondrite samples show evidence of extensive aqueous alteration on their parent asteroid(s)10,11, and although the presence of extra-terrestrial organic molecules has been demonstrated in these meteorites12–14, the question of how much of this alteration may be due to terrestrial contamination and weathering has not been resolved15–17.”
When applying A2K analysis to this sentence, the following A2K Description is obtained:
Subject: "the presence of extra-terrestrial" » Action: "has been demonstrated" »
Process: "in these meteorites12–14"
In this output, the Subject "the presence of extra-terrestrial" omits the key phrase “organic molecules”.
Furthermore, the Process includes the trailing string “12–14”, which refers to citation numbers in the original text and is unnecessary for summarization purposes.
Currently, the computational tool “A2K technology” has difficulty determining whether a numeric sequence is a citation reference, part of a gene name, or some other type of named entity.
Such structural parsing errors are expected to be resolved in the future through improvements in training data and A2K engine enhancements.
Errors contained in the output list of A2K Descriptions can be corrected and refined through manual curation by domain experts (curators).
WGI’s skilled curators are proficient not only in manual review but also in applying efficient computational linguistic processing on Linux environments to handle massive output lists both quickly and accurately.
Therefore, with WGI’s manual curation service, it is possible to perform deduplication of redundant descriptions in the output and error correction more efficiently and precisely than with manual efforts alone,
enabling delivery of high-quality analytical results.
In general, the process of gathering information relevant to individual projects involves collecting texts and documents through web searches, followed by manual literature reviews conducted by project members such as principal investigators (PIs), researchers, or technical staff.
With advancements in computing and analytical techniques, Natural Language Processing (NLP) technologies have become available, enabling computers to interpret textual data. Although many NLP tools can efficiently process large volumes of documents, they often suffer from limitations in both accuracy and output format.
For example, most current NLP tools are unable to extract concise summaries related to key target terms of user interest. In most cases, these tools merely return entire sentences containing the target term, making it difficult for users to quickly identify and understand relevant information to the target terms.
To overcome these limitations, WGI has developed an AI-driven text mining technology (A2K), specifically designed for the high-efficiency and high-precision extraction of summarized information on target terms—such as gene functions—from individual sentences, in a format referred to as the A2K Description.
The table below presents a comparison of manual review with A2K and LA2K technologies.
A2K technology analyzes the structure of English text and efficiently extracts basic English structures (SVO, SVC, etc.) that contain the terms of interest.
This makes it possible to quickly summarize only the information you need from vast amounts of text information.
Electronic files and website text, from which text is easy to extract, can be used as the analysis target, and multiple pieces of text information can be processed at once.
The output results of A2K are written in a format called "A2K Description" and consist of three elements: Subject, Action, and Process.
For example, if you focus on terms such as "hayabusa," "ryugu" and "organic", A2K automatically collects text containing these terms, performs structural analysis, and then summarizes and outputs the relevant sentences in the A2K Description format. The following is an example of an analysis from a document(here):
“Hayabusa2 spacecraft collected surface regolith particles from Ryugu in two separate touchdown events, which were stored in collection chambers A and C of the spacecraft’s sample catcher.”
For this statement, we get the following A2K Description:
Subject: "Hayabusa2 spacecraft" » Action: "collected" » Process: "surface regolith particles from Ryugu in two separate touchdown events"
Literature research is an extremely effective means of acquiring knowledge (findings) about research subjects such as genes, traits, and phenotypes, and is routinely performed by researchers. However, currently, the amount of information available in journals and papers is enormous, and the papers cover a wide range of fields of expertise. Reading and understanding all of these accessible papers is difficult due to the limited time and manpower available for literature research. WGI is developing and advancing A2K technology (see here for details), as well as developing and implementing AI technology that automatically identifies gene names, making it possible to comprehensively extract knowledge (findings) related to biological phenomena described in papers.
WGI is developing AI technology that automatically recognizes gene names that appear in text.
WGI is integrating and learning information on gene names (synonyms) that many researchers have given to the same gene.
By integrating synonyms, WGI makes it possible to comprehensively search and collect information using synonyms, even when searching gene databases.
In addition, by using AI to learn synonyms, WGI is developing technology that can determine whether English words and phrases in a text are gene names.
This makes it possible to comprehensively collect information on genes described in academic literature.
WGI has developed the LA2K (Life Science A2K) platform, which implements gene name identification technology in the A2K technology. By simply inputting the biological phenomenon that a researcher is studying (keywords and phrases such as genes, metabolism, traits, and environmental responses) into the LA2K platform, the platform automatically searches and collects academic literature, reads the papers, and extracts knowledge information (findings) about the genes involved in the biological phenomenon of interest and their biological functions.
The descriptions of the genes are summarized and output in the A2K Description format. For example, by inputting search phrases such as "Breast Cancer" and "Gene Expression Profiling" into the LA2K platform, functional information about breast cancer-related genes such as "Subject: "RASSF1" >> Action: "regulates" >> Process: "cell cycle progression" is output.
It is also possible to accumulate knowledge information that includes research target terms (e.g., chemical compounds, traits, environmental responses) in the Subject or Process of A2K Description.
There is no limit to the number of terms that can be searched, and a series of steps from literature search to knowledge information accumulation can be carried out seamlessly and in a short time.
By combining the results of LA2K analysis with information on transcription factors and cis-factors, gene expression information, and intra- and inter-species homologs (gene families) (described below), we can maximize the information base for elucidating the molecular mechanisms of various life activities and accelerate the achievement of our research goals.
Because the LA2K platform is based on AI analysis, it does not guarantee the same level of accuracy as when experts conduct literature research and collect gene function information. However, because it can also investigate papers in fields outside of a researcher's expertise that are difficult for researchers to understand, and because it can process a huge amount of literature in a short period of time, it will be an extremely powerful tool in many research settings.
LA2K results for genes, compounds, traits, etc. may contain errors due to computer processing. In addition, computer output may contain errors such as redundant A2K Descriptions or incorrect gene names WGI therefore offers a manual curation service by experts with extensive curation experience to verify, edit, and improve the quality of the raw A2K Descriptions output by the computer.
WGI's unique analysis method for RNA-seq big data enables highly accurate gene discovery.
RNA-seq data accumulated online is left to the data submitter to describe the RNA sampling conditions (experimental conditions), so the quality and quantity of the descriptions are mixed.
In addition, even if the experimental conditions are the same or similar, different submitters often describe the experimental conditions in different terms (e.g., cold treatment, low temperature, etc.), which makes it difficult to search the database for experimental data or to compare downloaded data.
Furthermore, even for the same biological species, the experimental materials (experimental strains and varieties) used by researchers are not the same.
Given the differences in traits and environmental response phenotypes between experimental strains and varieties, it is easy to infer that gene expression control and function differ between materials.
Currently, research is being conducted widely to search for genes with significant differences in expression levels (DEGs) between experimental conditions using RNA-seq data downloaded from databases, but the majority of studies ignore the differences in experimental strains and varieties, which raises the risk of not correctly extracting DEGs.
WGI collects online RNA-seq data and assigns an ontology based on the experimental conditions of each run. At the same time, we organize metadata for RNA-seq big data that organizes information on experimental materials and RNA extraction conditions (treatment, growth stage, etc.).
By determining the RNA-seq data to be analyzed, a gene expression matrix (each row represents a gene, each column represents a run, and each element represents gene expression level) can be obtained.
Gene groups with similar profiles can be inferred to be genes that are involved in the same biological process (flowering promotion, stress response, etc.) or genes whose expression is controlled by the same transcription factor.
Gene groups with opposing profiles can be inferred to be genes in a negative feedback relationship, or, if they are enzyme genes, genes in a relationship where a trade-off in expression levels (enzyme activity) occurs as metabolic pathways branch.
Traditionally, correlation analysis methods have been used to infer gene groups with similar or opposing profiles, but many genes have outliers in their expression profiles, making accurate identification impossible using correlation coefficient values, and inferred gene groups contain many false positives.
WGI uses cutting-edge data science methods to accurately extract gene groups with similar or opposing expression profiles from RNA-seq data, and by constructing gene networks to provide an overview of these, we provide an information infrastructure that will facilitate the inference of gene functions.
Homologous genes often have similar gene expression control and biological functions, which provides useful information for gene discovery. Traditionally, gene families have been constructed by sequence homology searches, but the commonality of protein functional domains has been largely ignored. WGI has constructed an index that represents domain conservation by combining protein functional domain combinations and domain location information, and has carried out large-scale family classification of protein-coding genes within and between species based on these indices.
Cross-referencing knowledge information on the expression control mechanisms, expression profiles, and biological functions of genes belonging to the same family accelerates gene function estimation and gene discovery. Network analysis, which allows for efficient handling of vast amounts of information, is an extremely effective means of utilizing this information. WGI constructs a network that integrates all omics information and knowledge information, and provides network information in a format that can be edited with general-purpose free software.
In addition to collecting information on transcription factors and cis-factors using LA2K technology and manual curation, WGI is also developing AI technology (the accuracy depends on the species) to identify transcription factors and cis-factors. Using this information on gene expression regulation makes it easier to, for example, design gRNA sequences to efficiently create knockout strains using genome editing
WGI has developed an analysis pipeline for gRNA design with no or very few off-targets. Furthermore, when there are multiple genes to be modified, it is possible to design gRNAs that induce mutations simultaneously and have no (or few) off-targets.
We provide all kinds of consultations and analysis related to text mining and bioinformatics at low cost.
We can solve your problems if you don't have specialized knowledge of text mining or bioinformatics and don't know where to start or which analysis method to use.
We support all your requests and research promotion related to text mining and bioinformatics, including providing ideas and skills regarding analysis policies when planning and promoting research, one-off information analysis, building databases and knowledge bases, building LIMS/FIMS, supporting the introduction of bioinformatics analysis platforms within laboratories, training personnel, advising (monthly to yearly, etc.), and giving lectures.
We support students in acquiring specialized knowledge and preparing for graduate school in a field that specializes in and utilizes bioinformatics.
The languages used will be Japanese and English.
We will provide the highest quality service possible according to your requests regarding delivery dates, research plans, budgets, etc.
Analysis results will be provided online as electronic files.
Payment will be made by bank transfer.