Each folder in the SEWA database may contain the following types of data / annotations:

  1. Metadata: metadata of the folder saved in the file named ‘metadata.ini’. The main purpose of the file is to provide an index of the data files saved in the folder. Usually the file also provides contextual information (e.g., session number, subject number, folder category, etc.) and the subject’s demographic information. In cases when the folder consists of a segment cut from a full recording, the metadata also includes fields specifying the segment’s location (given by the first and last frame numbers) in the original recording.
  2. Audio: audio data recorded during the experiment. The data is saved in a ‘*.wav’ file in most cases or file named ‘*-Audio.zip’ in cases when the folder contains multiple episodes / templates.
  3. Video: video data recorded during the experiment. The data is a ‘*.avi’ file in most cases of or a file named ‘*-Video.zip’ in cases when the folder contains multiple episodes / templates. The video data has an effective resolution of either 320x240 (for video-chat recording) or 480x360 (for advert-watching recordings) pixels and an effective frame rate of 20 to 30 fps. The video is compressed using Xvid codec (https://www.xvid.com/).
  4. Log: event log generated by the audio-visual recording website. The log mainly contains timestamps that may be used for synchronisation purposes.
  5. Registration Information: the subject’s answers to the questions shown on the recording website’s registration page. The information is saved in a file named ‘Registration_*.json’. The information saved in the file mainly consists of the subject’s demographic information and his / her answers to the 5-question personality test.
  6. Answers to the Questionnaire: the subject’s answers to the questionnaire shown after the advert or the video-chat session. These answers are saved in a file named ‘*_questionnaire.json’.
  7. Facial Landmarks: the per-frame facial landmark locations saved in a file named ‘*-Landmarks.zip’. Within the zip file, there is one ‘*.txt’ file corresponding to each frame in the video (note that frame number starts from 1). Each text file contains three lines of numbers. The first line gives the face orientation in terms of pitch, yaw and roll (in degrees). The second and the third line give the coordinates (in the order of ‘x1 x1 x2 y2 x3 y3 …’) of eye points and the 49 facial landmarks, respectively. All numbers would be set to -1 when there is no face presented in the frame.
  8. LLD Features: the low-level descriptor (LLD) features extracted from the audio data. These features are saved in a file named ‘*-LLD.zip’. Two files can be found in the zip file: the one ending with ‘-ComPareELLD.arff’ contains 65 LLDs per frame with a frame step of 10 ms and the one ending with ‘-GeMAPSv01aLLD’ contains 18 LLDs. Details about the LLDs can be found here.
  9. Hand Gesture: hand gesture annotation saved in a file named ‘*-HandGesture.csv’. The hand gestures are annotated in 5-frame steps. For each annotated frame (note that frame number starts from 1), we given information about whether the hand is visible (‘hand_not_visible’), whether the hand is static (‘static_hand’), whether the subject is touching his / her head (‘static_hand_at_head’), and whether the subject is performing a gesture (‘dynamic_gesturing_hand’ and ‘dynamic_no_handgesture’).
  10. Head Gesture: head gesture annotation saved in a file named ‘-HeadGesture.csv’. Only unambiguously displayed head nods and head shakes are annotated. In the CSV file, each head gesture is marked by the frame number (starts from 1) of its first and last frame (both inclusive). The CSV file may be empty if no head nod / shake is presented in the whole duration of the video.
  11. Transcript: audio transcript saved in a file named ‘*-Transcript.csv’. The file marks every sentence’s start and end time (in seconds), the speaker’s subject ID, and the verbal content. Some non-verbal utterances, such as laughter, coughing sound, and breathing sound, are also marked.
  12. Valence, Arousal, and Liking / Disliking: the subject’s continuously valued valence, arousal and liking / disliking levels annotated based on the audio / video data. The annotations were provided by 5 annotators from the same culture of the recorded subject. The annotation was performed in real-time using a joystick. Specifically, the annotators were asked to push / pull the joystick based on their perception of the subject’s level of valence, arousal, or liking / disliking (toward the advert) while being presented the recording. The joystick position (a value between -1000 and 1000) was sampled at 66 Hz and saved into the result file. The annotation tasks were repeated three times for the segments included the basic SEWA dataset, first on audio data only, then on video data only and finally on audio-visual data. The annotation results on valence, arousal, and liking / disliking are saved in files named ‘*-Valence.zip’, ‘*-Arousal.zip’ and ‘*-Liking.zip’, respectively. Every zip file contains multiple CSV files, each storing the result produced by one annotator. The suffix of the CSV file name marks the annotator number and whether the annotation is performed on audio data (A), video data (V) or audio-visual data (AV). For instance, suffix ‘_AV3’ indicates the CSV file stores the annotation result on audio-visual data given by annotator #3.
  13. Template Behaviours: templates behaviours of low / high valence, low / high arousal and liking / disliking displayed by subjects from different cultures. These templates were selected by annotators from the same culture of the subjects. The template annotations are saved in files named ‘-Catelogue.csv’, which specify the source recordings of the templates and the templates’ location in them.
  14. Episodes of Agreement / Disagreement: Episodes of subjects showing low / mid / high-intensity agreement or disagreement during video-chat. The episodes were selected by annotators from the same culture of the recorded subjects. The agreement / disagreement episodes annotations are saved in files named ‘-Catalogue.csv’, which specify the source recordings of the episodes and the episodes location in them.
  15. Episodes of Mimicry: episodes selected from the video-chat recordings in which one subject mimics the behaviour of the other subject. The annotations are saved in files named ‘-Catalogue.csv’. Each entry in the files gives information about the source recording of the episode, location of the episode in the source recording, and when subject2 starts to mimic subject1.