TagTell: Automatic tag detection & extraction in large scale text corpora

[fusion_builder_container type=”flex” hundred_percent=”no” hundred_percent_height=”no” min_height_medium=”” min_height_small=”” min_height=”” hundred_percent_height_scroll=”no” align_content=”stretch” flex_align_items=”flex-start” flex_justify_content=”center” flex_column_spacing=”” hundred_percent_height_center_content=”yes” equal_height_columns=”no” container_tag=”div” menu_anchor=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” status=”published” publish_date=”” class=”” id=”” margin_top_medium=”” margin_bottom_medium=”” margin_top_small=”” margin_bottom_small=”” margin_top=”” margin_bottom=”” padding_top_medium=”” padding_right_medium=”” padding_bottom_medium=”” padding_left_medium=”” padding_top_small=”” padding_right_small=”” padding_bottom_small=”” padding_left_small=”” padding_top=”” padding_right=”” padding_bottom=”” padding_left=”” link_color=”” hue=”” saturation=”” lightness=”” alpha=”” link_hover_color=”” border_sizes_top=”” border_sizes_right=”” border_sizes_bottom=”” border_sizes_left=”” border_color=”” border_style=”solid” box_shadow=”no” box_shadow_vertical=”” box_shadow_horizontal=”” box_shadow_blur=”0″ box_shadow_spread=”0″ box_shadow_color=”” box_shadow_style=”” z_index=”” overflow=”” gradient_start_color=”” gradient_end_color=”” gradient_start_position=”0″ gradient_end_position=”100″ gradient_type=”linear” radial_direction=”center center” linear_angle=”180″ background_color=”” background_image=”” skip_lazy_load=”” background_position=”center center” background_repeat=”no-repeat” fade=”no” background_parallax=”none” enable_mobile=”no” parallax_speed=”0.3″ background_blend_mode=”none” video_mp4=”” video_webm=”” video_ogv=”” video_url=”” video_aspect_ratio=”16:9″ video_loop=”yes” video_mute=”yes” video_preview_image=”” render_logics=”” absolute=”off” absolute_devices=”small,medium,large” sticky=”off” sticky_devices=”small-visibility,medium-visibility,large-visibility” sticky_background_color=”” sticky_height=”” sticky_offset=”” sticky_transition_offset=”0″ scroll_offset=”0″ animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” filter_hue=”0″ filter_saturation=”100″ filter_brightness=”100″ filter_contrast=”100″ filter_invert=”0″ filter_sepia=”0″ filter_opacity=”100″ filter_blur=”0″ filter_hue_hover=”0″ filter_saturation_hover=”100″ filter_brightness_hover=”100″ filter_contrast_hover=”100″ filter_invert_hover=”0″ filter_sepia_hover=”0″ filter_opacity_hover=”100″ filter_blur_hover=”0″][fusion_builder_row][fusion_builder_column type=”1_1″ layout=”1_1″ align_self=”auto” content_layout=”column” align_content=”flex-start” valign_content=”flex-start” content_wrap=”wrap” spacing=”” center_content=”no” column_tag=”div” link=”” target=”_self” link_description=”” min_height=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” sticky_display=”normal,sticky” class=”” id=”” type_medium=”” type_small=”” order_medium=”0″ order_small=”0″ dimension_spacing_medium=”” dimension_spacing_small=”” dimension_spacing=”” dimension_margin_medium=”” dimension_margin_small=”” margin_top=”” margin_bottom=”” padding_medium=”” padding_small=”” padding_top=”” padding_right=”” padding_bottom=”” padding_left=”” hover_type=”none” border_sizes=”” border_color=”” border_style=”solid” border_radius=”” box_shadow=”no” dimension_box_shadow=”” box_shadow_blur=”0″ box_shadow_spread=”0″ box_shadow_color=”” box_shadow_style=”” overflow=”” background_type=”single” gradient_start_color=”” gradient_end_color=”” gradient_start_position=”0″ gradient_end_position=”100″ gradient_type=”linear” radial_direction=”center center” linear_angle=”180″ background_color=”” background_image=”” background_image_id=”” lazy_load=”avada” skip_lazy_load=”” background_position=”left top” background_repeat=”no-repeat” background_blend_mode=”none” render_logics=”” filter_type=”regular” filter_hue=”0″ filter_saturation=”100″ filter_brightness=”100″ filter_contrast=”100″ filter_invert=”0″ filter_sepia=”0″ filter_opacity=”100″ filter_blur=”0″ filter_hue_hover=”0″ filter_saturation_hover=”100″ filter_brightness_hover=”100″ filter_contrast_hover=”100″ filter_invert_hover=”0″ filter_sepia_hover=”0″ filter_opacity_hover=”100″ filter_blur_hover=”0″ animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=”” last=”true” border_position=”all” first=”true”][fusion_youtube id=”xoRx556foYE” alignment=”center” width=”” height=”” autoplay=”false” api_params=”” title_attribute=”” video_facade=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” class=”” css_id=”” /][fusion_text columns=”” column_min_width=”” column_spacing=”” rule_style=”” rule_size=”” rule_color=”” hue=”” saturation=”” lightness=”” alpha=”” content_alignment_medium=”” content_alignment_small=”” content_alignment=”” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” sticky_display=”normal,sticky” class=”” id=”” margin_top=”” margin_right=”” margin_bottom=”” margin_left=”” fusion_font_family_text_font=”” fusion_font_variant_text_font=”” font_size=”” line_height=”” letter_spacing=”” text_transform=”” text_color=”” animation_type=”” animation_direction=”left” animation_speed=”0.3″ animation_offset=””]

MARKET NEED

Many companies face an increase in data volume of unstructured text documents from customer feedback (surveys, customer support, chatlines, correspondence) or from company-specific documentation.

There is a need to quickly understand the content of such data and its main topics. User-tagging of large volume of text data entries is expensive, time-consuming and user-dependent.

A solution is to automatically assign tags to such datasets. Tags can be extracted directly from the text data or can represent text content at a semantic level.

Tags can have multiple uses:

Re-structuring data collections to support classification of data;
Information retrieval based on new and more relevant attributes;
Finding documents linked through common tags.

TagTell visualization illustrating most frequent tags extracted from a data set of financial news articles

TECHNOLOGY SOLUTION

TagTell is an interactive web application based on unsupervised key-phrase extraction methods and Natural Language Processing.

Three domain-independent, unsupervised approaches are implemented:

Key-phrase extraction from text & various ranking mechanisms (term frequency, heuristics and graph-based co-occurrence);
Similarity-based clustering of word embeddings;
Using external resources for tag refinement and augmentation (Wordnet lexicon hierarchy).

TagTell approaches for automatic tag extraction using various key-phrase extraction methods.

KEY FEATURES

The TagTell demonstrator provides an interactive platform for automatic tag extraction enabling:

Tag extraction, ranking and visualization as a network graph;
Tags updates (select, delete, replace tags with user-defined tags);
Drill-down to find all documents with a certain tag;
Download the data with added tags.

RESEARCH TEAM

Jayadeep Kumar Sasikumar, DIT
Jimmy Doré , DIT
Dr. Tamara Matthews, DIT
Prof. Sarah Jane Delany, DIT

[/fusion_text][/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]