Menu Close

Datasets

Predicting Media Memorability – MediaEval 2020

Data is composed of 6,000 short videos retrieved from TRECVid 2019 Video to Text dataset. Each video consists of a coherent unit in terms of meaning and is associated with two scores of memorability that refer to its probability to be remembered after two different durations of memory retention. Similar to previous editions of the task, memorability has been measured using recognition tests, i.e., through an objective measure, a few minutes after the memorization of the videos (short term), and then 24 to 72 hours later (long term).

Now, a subset of dataset is available including 590 videos as part of the training set. The ground truth of the development data will be enhanced with more annotators per movie with the release of the test data. This would allow to experiment whether increasing the annotations’ agreement has a direct influence on the prediction quality. Nevertheless, methods should cope with a lower annotator agreement, which is specific to such subjective tasks.

The videos are shared under Creative Commons licenses that allow their redistribution. They come with a set of pre-extracted features, such as: Aesthetic Features, C3D, Captions, Colour Histograms, HMP, HoG, Fc7 layer from InceptionV3, LBP, or ORP. In comparison to the videos used in this task in 2018 and 2019, the TRECVid videos have much more action happening in them and thus are more interesting for subjects to view.

You can find more details here.

ImageCLEFcoral 2020

The data for this task originates from a growing, large-scale collection of images taken from coral reefs around the world as part of a coral reef monitoring project with the Marine Technology Research Unit at the University of Essex.
Substrates of the same type can have very different morphologies, color variation and patterns. Some of the images contain a white line (scientific measurement tape) that may occlude part of the entity. The quality of the images is variable, some are blurry, and some have poor color balance. This is representative of the Marine Technology Research Unit dataset and all images are useful for data analysis. The images contain annotations of the following 13 types of substrates: Hard Coral – Branching, Hard Coral – Submassive, Hard Coral – Boulder, Hard Coral – Encrusting, Hard Coral – Table, Hard Coral – Foliose, Hard Coral – Mushroom, Soft Coral, Soft Coral – Gorgonian, Sponge, Sponge – Barrel, Fire Coral – Millepora and Algae – Macro or Leaves.
The test data contains images from four different locations:

  • same location as training set
  • similar location to training set
  • geographically similar to training set
  • geographically distinct from training set

You can find more details here.

ImageCLEFcaption 2020

From the PubMed Open Access subset containing 1,828,575 archives, a total number of 6,031,814 image – caption pairs were extracted. To focus on radiology images and non-compound figures, automatic filtering with deep learning systems as well as manual revisions were applied. In ImageCLEF 2020, additional information regarding the modalities of all 80,747 images will be distributed.

You can find more details here.

ImageCLEF coral 2019

The data for this task originates from a growing, large-scale collection of images taken from coral reefs around the world as part of a coral reef monitoring project with the Marine Technology Research Unit at the University of Essex.
Substrates of the same type can have very different morphologies, color variation and patterns. Some of the images contain a white line (scientific measurement tape) that may occlude part of the entity. The quality of the images is variable, some are blurry, and some have poor color balance. This is representative of the Marine Technology Research Unit dataset and all images are useful for data analysis. The images contain annotations of the following 13 types of substrates: Hard Coral – Branching, Hard Coral – Submassive, Hard Coral – Boulder, Hard Coral – Encrusting, Hard Coral – Table, Hard Coral – Foliose, Hard Coral – Mushroom, Soft Coral, Soft Coral – Gorgonian, Sponge, Sponge – Barrel, Fire Coral – Millepora and Algae – Macro or Leaves.

The training set contains contains 240 images with 6670 substrates annotated.Two files are provided with ground truth annotations: one based on bounding boxes “imageCLEFcoral2019_annotations_training_task_1” and a more detailed annotation based on bounding polygon “imageCLEFcoral2019_annotations_training_task_2”. The test set contains 200 images.

You can find more details here.

ImageCLEFcaption 2019

From the PubMed Open Access subset containing 1,828,575 archives, a total number of 6,031,814 image – caption pairs were extracted. To focus on radiology images and non-compound figures, automatic filtering with deep learning systems as well as manual revisions were applied, reducing the dataset to 70,786 radiology images of several medical imaging modalities.

You can find more details here.

A corpus of violence acts in Arabic social media (LREC 2016)

In this paper Alhelbawy, Kruschwitz and Poesio present a new corpus of Arabic tweets that mention some form of violent event, developed to support the automatic identification of human rights abuses and different violent acts. The dataset was manually labelled for seven classes of violence using
crowdsourcing. Only tweets classified with a high degree of agreement were included in the final dataset.