A visual language is a form of communication that relies on visual symbols outside of text to convey information. It’s ubiquitous in our digital lives in the form of iconography, infographics, charts, plots, and charts, extending into the real world in the form of street signs, comic books, food labels, and more. Therefore, by better understanding computers, these types of media can help in scientific communication and discovery, accessibility and transparency of data.
Although computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images, where all kinds of tasks such as classification, visual question answering (VQA), captioning, detection, and segmentation, defined, studied, and in some cases advanced to achieve human performance. However, visual language has not received this level of attention, perhaps due to the lack of large-scale learning collections in this area. But over the past few years, new academic datasets have been created with the goal of evaluating question-answering systems on visual language representations, such as PlotQA, InfographicsVQA, and ChartQA.
|Example from ChartQA. Answering the question requires reading the information and calculating the sum and difference.|
Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines, but the process is error-prone, slow, and poorly generalizable. The prevalence of these methods was due to the fact that existing computer vision models based on convolutional neural networks (CNNs) or pre-trained transformers on natural images cannot be easily adapted to visual language. But existing models are ill-equipped to deal with the challenges of pie charts, including reading the relative height of bars or the angle of segments in pie charts, understanding axis scales, correctly mapping petroglyphs to their mythic values of color, size, and texture, and finally performing numerical operations on the extracted numbers.
In light of these challenges, we propose “MatCha. Enhancing Visual Language Learning with Mathematical Reasoning and Graphing”. MatCha, which stands for Math and Charts, is a pixel-to-text base model (a pre-trained model with built-in inductive biases that can be fine-tuned for multiple applications) trained on two additional tasks: (a) chart. reproduction and (b) mathematical reasoning. When rendering charts, for a given chart or graph, the image-to-text model is required to create its underlying data table or the code used to render it. For pre-teaching mathematical reasoning, we select text-to-number reasoning datasets and input them into images that require decoding of the image-to-text model for answers. We also offer DePlot. one-shot visual language reasoning by plot-to-table translation,” a model built on top of MatCha for one-shot reasoning on charts on tables. With these methods, we outperform ChartQA’s prior technology by more than 20% and match the best summary systems with 1000x more parameters. Both papers will be presented at ACL2023.
Removing a chart
Plots and charts are usually created using an underlying data table and a piece of code. The code defines the overall layout of the image (eg type, orientation, color/shape scheme), and the underlying data table defines the actual numbers and their groupings. Both data and code are sent to the compiler/rendering engine to produce the final image. Understanding a graph requires identifying visual patterns in an image and efficiently analyzing and grouping them to extract key information. Reversing the plot reproduction process requires all such features and thus can serve as an ideal pre-training problem.
|Airbus A380 A chart created from a table on the Wikipedia page using random plotting options. MatCha’s initial training task consists of retrieving the source table or source code from an image.|
In practice, it is difficult to get charts, their underlying data tables, and their rendering code at the same time. In order to collect enough data for preparation, we collect it ourselves [chart, code] and: [chart, table] couple For [chart, code], we crawl all GitHub IPython notebooks with relevant licenses and extract blocks with numbers. The image and block of code just before saving as a [chart, code] couple For [chart, table] in pairs, we studied two sources. For the first source, synthetic data, we write code by hand to convert web-surfable Wikipedia tables from the TaPas codebase to charts. We sampled and combined several plotting options depending on the column types. In addition, we also add [chart, table] Pairs generated in PlotQA to diversify the training corpus. The second source is a web dive [chart, table] couple We use directly [chart, table] Pairs were drawn on the ChartQA training set, which contained a total of about 20,000 pairs from four sites: Statista, Pew, Our World in Data, and the OECD.
We incorporate numerical reasoning knowledge into MatCha by learning mathematical reasoning skills from textual mathematical datasets. For training, we use two existing textual mathematical reasoning datasets, MATH and DROP. MATH is created synthetically, containing two million learning examples for each module (type) of questions. DROP is a reading-comprehension-style QA database where the input is a paragraph context and a question.
To solve questions in DROP, the model has to read the passage, extract the relevant numbers and perform a numerical calculation. We found the two datasets to be complementary. MATH contains a large number of questions for different categories, which helps us to identify the mathematical operations required to inject clearly into the model. DROP’s reading-comprehension format is similar to a typical QA format, while the models perform information extraction and reasoning simultaneously. In practice, we pass the entries from both databases into the images. The model is prepared to decode the response.
|To improve MatCha’s mathematical reasoning skills, we include examples from MATH and DROP in the pre-training objective by representing the input text as an image.|
We use the backbone of the Pix2Struct model, an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by fine-tuning it on several visual language tasks; tasks involving graphs and plots for answering questions and summarizing where the underlying table cannot be accessed. MatCha outperforms previous models by a wide margin and also surpasses the prior state of the art, which implies access to the underlying spreadsheets.
In the figure below, we first evaluate two baseline models that incorporate information from the OCR pipeline, which until recently was the standard approach to working with graphs. The first is based on T5, the second on VisionTaPas. We also compare with the PaLI-17B, a large (~1000 times larger than other models) image and text-to-text converter that is built on a variety of tasks, but with limited capabilities for reading text and other forms of visual language. : . Finally, we report the results of the Pix2Struct and MatCha model.
|Experimental results for the two chart QA benchmarks ChartQA and PlotQA (using relaxed precision) and the chart-summarization benchmark Chart-Text (using BLEU4). The Matcha surpasses the level of technology in terms of QA compared to the larger models, and matches the summation of these larger models.|
For QA datasets, we use a formal accuracy measure that allows for small relative errors in numerical results. For graph-to-text summaries, we report BLEU scores. MatCha achieves markedly improved results for question answers, and comparable results to PaLI for summarization, where the large size and extensive long text/title generation pre-training favors this type of long text generation.
Derendering plus model chains for large languages
Although extremely efficient in their parameter count, especially in extraction tasks, we observed that refined MatCha models can still struggle with complex end-to-end reasoning (eg, math operations involving large numbers or many steps). So we also propose a two-step method to solve this problem. 1) the model reads the graph and then outputs the underlying table, 2) the large language model (LLM) reads this output and then tries to answer the question based solely on; text input.
For the first model, we refined MatCha exclusively on the table-to-table task, increasing the length of the output sequence to ensure that it could recover all or most of the information in the graph. Deplot is the resulting model. In the second phase, any LLM (such as FlanPaLM or Codex) can be used for the task, and we can rely on standard methods to increase performance on LLMs, such as chain of thought and self-consistency. We also experimented with a mind program where the model produces executable Python code to offload complex calculations.
|Illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLMs and the red boxes contain the response received by the LLMs. In each answer, we highlight some key reasoning steps.|
As shown in the example above, the DePlot model combined with LLMs outperforms the refined models by a significant margin, especially in the human-sourced part of ChartQA, where the questions are more natural but require more complex reasoning. Furthermore, DePlot+LLM can do this without access to any training data.
We’ve released the new mockups and code to our GitHub repo, where you can try it out for yourself in the collab. Check the MatCha and DePlot documentation for more details on the experimental results. We hope that our results can be useful to the research community and make the information in charts and plots more accessible to everyone.
This work was carried out by Fangyu Liu, Julian Martin Eisenschloss, Francesco Picino, Sirin Krichen, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Jasmine Altun from our language team as part of the Fangyu Internship Project. Nigel Collier from Cambridge was also a collaborator. We would like to thank Joshua Howland, Alex Polozov, Shrestha Basu Malik, Massimo Nicosia, and William Cohen for valuable comments and suggestions.