Cocoon - Data Modeling

We display the source tables from the data warehouses to model.

comment (first 100 rows)

	id	_fivetran_synced	author_id	body	created	is_public	issue_id	update_author_id	updated
0	10001	2020-11-12 12:20:53.148	557058:a9572f6a-3041-435a-96ed-5286e3811b33	Hello	2020-11-10 19:19:41.224	True	10025	557058:a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-10 19:19:41.224
1	10013	2020-11-10 19:21:48.619	557058:a9572f6a-3041-435a-96ed-5286e3811b33	To Do to In Progress 6 days 22 hours 26 minutes ago In Progress to Done 3 days 16 hours 34 minutes ago	2020-11-07 02:45:38.717	True	10035	557058:a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717
2	10012	2020-11-10 19:21:48.618	557058:a9572f6a-3041-435a-96ed-5286e3811b33	Joined Sample Sprint 2 7 days 9 hours 10 minutes ago	2020-11-07 02:45:38.717	True	10035	557058:a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717

component (first 100 rows)

	id	_fivetran_synced	description	name	project_id
0	10004	2020-11-23 12:21:02.159	NaN	Component 5	10001
1	10003	2020-11-23 12:21:02.159	NaN	Component 4	10001
2	10002	2020-11-23 12:21:02.158	NaN	Component 3	10001
3	10019	2020-11-23 12:21:02.157	NaN	Component 2	10001

epic (first 100 rows)

	id	_fivetran_synced	done	key_	name	summary
0	10014	2020-11-17 12:20:54.934	False	TP-11	Test Epic	just for testing

field (first 100 rows)

	id	_fivetran_synced	is_array	is_custom	name
0	timeoriginalestimate	2020-11-23 22:20:39.685	False	False	Original Estimate
1	creator	2020-11-23 22:20:39.711	False	False	Creator
2	issuerestriction	2020-11-23 22:20:39.643	False	False	Restrict to
3	components	2020-11-23 22:20:40.643	False	False	Components
4	summary	2020-11-23 22:20:41.553	False	False	Summary
5	epiclink	2020-11-23 22:20:41.553	False	False	Epic-Link
6	epic	2020-11-23 22:20:41.553	False	False	Epic

field_option (first 100 rows)

	id	_fivetran_synced	name
0	10104	2020-11-23 12:21:00.979	opt 2
1	10016	2020-11-17 12:20:52.552	To Do
2	10106	2020-11-17 12:20:52.227	opt 4
3	10019	2020-11-19 12:20:53.110	Impediment
4	10026	2020-11-20 12:20:54.221	This is the summary of the issue.

issue (first 100 rows)

	id	_fivetran_deleted	_fivetran_synced	_original_estimate	_remaining_estimate	_time_spent	assignee	created	creator	description	due_date	environment	issue_type	key_	last_viewed	original_estimate	parent_id	priority	project	remaining_estimate	reporter	resolution	resolved	status	status_category_changed	summary	time_spent	updated	work_ratio
0	10011	False	2020-11-06 22:23:38	NaN	NaN	NaN	NaN	2020-04-01 08:33:58	5c3326c24b248c315badf3ee	Yes, this is an issue to test permissions	NaN	NaN	10103	TP-8	None	NaN	NaN	3	10001	NaN	5c3326c24b248c315badf3ee	NaN	2020-05-29 11:33:53	3	2020-04-01 08:34:02	We want to test permissions	NaN	2020-05-29 11:33:53	-1
1	10015	False	2020-11-06 22:23:41	NaN	NaN	NaN	NaN	2020-06-11 07:38:03	5ed0edbe2999b60c23255b4f	None	NaN	NaN	10103	TP-12	None	NaN	NaN	5	10001	NaN	5ed0edbe2999b60c23255b4f	NaN	2020-06-11 07:38:03	10003	2020-06-11 07:38:03	Part of test epic	NaN	2020-06-11 07:38:03	-1
2	10018	False	2020-11-12 12:20:50	NaN	NaN	NaN	NaN	2020-11-10 19:11:45	557058:a9572f6a-3041-435a-96ed-5286e3811b33	None	NaN	NaN	10103	TCP-1	2020-11-10 19:11:46	NaN	NaN	3	10005	NaN	557058:a9572f6a-3041-435a-96ed-5286e3811b33	10000.0	2020-11-28 19:11:45	10003	2020-11-28 19:11:46	this is a story	NaN	2020-11-28 19:11:45	-1

issue_field_history (first 100 rows)

	field_id	issue_id	time_	_fivetran_synced	value_
0	created	10027	2020-11-10 19:19:41.472	2020-11-12 12:20:53.478	2020-11-10T19:19:41.472Z
1	customfield_10104	10027	2020-11-10 19:19:41.472	2020-11-12 12:20:53.472	3.0
2	summary	10027	2020-11-10 19:19:41.472	2020-11-12 12:20:53.500	As a developer, I'd like to update story status during the sprint >> Click the Active sprints link at the top right of the screen to go to the Active sprints where the current Sprint's items can be updated
3	components	10018	2020-11-10 19:19:41.472	2020-11-19 12:20:53.500	10019
4	summary	10018	2020-11-10 19:19:41.472	2020-11-19 12:20:53.500	10026
5	epic	10015	2020-11-10 19:19:41.472	2020-11-19 12:20:53.500	1111111
6	epiclink	10018	2020-11-10 19:19:41.472	2020-11-12 12:20:53.500	https://ulr-here/epic-key

issue_link (first 100 rows)

	issue_id	related_issue_id	_fivetran_synced	relationship
0	10002	10004	2020-11-06 22:23:39.405	duplicates
1	10004	10006	2020-11-23 12:21:01.461	duplicates
2	10004	10005	2020-11-23 12:21:01.454	duplicates

issue_multiselect_history (first 100 rows)

	_fivetran_id	time_	_fivetran_synced	field_id	issue_id
0	w4UiT+hPMxJp3RhL/YFJm3uWL5U=	2020-11-10 19:19:41.472	2020-11-12 12:20:53.506	subtasks	10027
1	4pVgGn0qSqR2hCmMdo4wWHXmgew=	2020-11-10 19:19:41.472	2020-11-12 12:20:53.479	customfield_10021	10027
2	/zrY8m6q0VMW6ia1jGIerXqLIgQ=	2020-11-10 19:19:41.472	2020-11-12 12:20:53.479	customfield_10020	10027
3	3p3gGn0qSqR2hCmMdo4wWHXa32m=	2020-11-10 19:19:41.472	2020-11-12 12:20:53.479	components	10027

issue_type (first 100 rows)

	id	_fivetran_synced	description	name	subtask
0	10117	2020-11-23 22:20:40.176	Bugs track problems or errors.	Bug	False
1	10106	2020-11-23 22:20:40.171	A problem or error.	Bug	False
2	10101	2020-11-23 22:20:40.177	Track large pieces of work.	Epic	False

priority (first 100 rows)

	id	_fivetran_synced	description	name
0	2	2020-11-06 22:23:39.764	Serious problem that could block progress.	High
1	5	2020-11-06 22:23:41.009	Trivial problem with little or no impact on progress.	Lowest
2	1	2020-11-20 12:20:54.303	This problem will block progress.	Highest
3	3	2020-11-25 12:20:42.395	Has the potential to affect progress.	Medium
4	4	2020-11-17 12:20:52.382	Minor problem or easily worked around.	Low

project (first 100 rows)

	id	_fivetran_synced	description	key_	lead_id	name	permission_scheme_id	project_category_id	project_type_key
0	10008	2020-11-17 12:20:50.283	NaN	TBTP	557058:a9572f6a-3041-435a-96ed-5286e3811b33	test bug tracking project	10001	NaN	software
1	10005	2020-11-12 12:20:49.801	NaN	TCP	557058:a9572f6a-3041-435a-96ed-5286e3811b33	test classique project	10001	NaN	software
2	10001	2020-11-23 12:21:02.150	NaN	TP	5c3326c24b248c315badf3ee	Test Project	10001	10000.0	software

project_board (first 100 rows)

	board_id	project_id	_fivetran_synced
0	7	10006	2020-11-22 12:20:57.685
1	5	10004	2020-11-22 12:20:59.875

project_category (first 100 rows)

	id	_fivetran_synced	description	name
0	10000	2020-11-23 12:21:02.157	will this create a project_category table?	test-category

resolution (first 100 rows)

	id	_fivetran_synced	description	name
0	10000	2020-11-22 12:20:59.338	Work has been completed on this issue.	Done

sprint (first 100 rows)

	id	_fivetran_synced	board_id	complete_date	end_date	name	start_date
0	4	2020-11-22 12:20:59.513	5	None	2020-11-26 21:25:12	TNGP Sprint 1	2020-11-12 21:25:17.778
1	1	2020-11-22 12:20:57.012	0	2020-11-10 20:52:38.262	2020-11-17 10:29:44.653	Sample Sprint 2	2020-11-03 10:09:44.653
2	3	2020-11-22 12:20:57.298	7	None	2020-11-26 21:55:00	Sample Sprint 2	2020-11-12 21:55:42.220

status (first 100 rows)

	id	_fivetran_synced	description	name	status_category_id
0	10010	2020-11-23 22:20:40.409	None	In Progress	4
1	3	2020-11-23 22:20:40.391	This issue is being actively worked on at the moment by the assignee.	In Progress	4
2	10000	2020-11-23 22:20:40.404	None	To Do	2

status_category (first 100 rows)

	id	_fivetran_synced	name
0	3	2020-11-23 22:20:40.410	Done
1	2	2020-11-23 22:20:40.411	To Do
2	4	2020-11-23 22:20:40.411	In Progress

user (first 100 rows)

	id	_fivetran_synced	email	locale	name	time_zone	username
0	557058:a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-23 12:21:02.355	NaN	en_US	Scooby Doo	America/Los_Angeles	NaN
1	5ed0edbe2999b60c23255b4f	2020-11-23 12:21:02.562	NaN	en_US	Scrappy Cornelius Doo	America/Los_Angeles	NaN
2	5c3326c24b248c315badf3ee	2020-11-23 12:21:02.757	NaN	en_US	Norville Shaggy Rogers	America/Los_Angeles	NaN

user_group (first 100 rows)

	group_name	user_id	_fivetran_synced
0	jira-software-users	5faf114ad74f69007579568f	2020-11-20 12:20:54.925
1	site-admins	5d577670dbb98c0d9c232bb2	2020-11-23 12:21:01.943
2	confluence-users	557058:a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-23 12:21:02.355

version (first 100 rows)

	id	_fivetran_synced	archived	description	name	overdue	project_id	release_date	released	start_date
0	10001	2021-06-17 14:20:59.271	False	second version upd	v2.0	True	10001	2020-06-01	False	2020-05-01
1	10003	2021-04-13 16:31:23.435	False	None	Version 2.0	True	10006	2020-11-17	False	None
2	10004	2021-04-13 16:31:23.435	False	None	Version 3.0	False	10006	None	False	None

Source tables may have typos, unclear names, incorrect column types, etc. We clean these tables.

stg_field (first 100 rows)

	field_id	is_multi_value	is_custom_field	field_name
0	timeoriginalestimate	False	False	Original Estimate
1	creator	False	False	Creator
2	issuerestriction	False	False	Restrict to
3	components	False	False	Components
4	summary	False	False	Summary
5	epiclink	False	False	Epic-Link
6	epic	False	False	Epic

stg_field.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"field_projected" AS (
    -- Projection: Selecting 4 out of 5 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "is_array",
        "is_custom",
        "name"
    FROM "field"
),

"field_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> field_id
    -- is_array -> is_multi_value
    -- is_custom -> is_custom_field
    -- name -> field_name
    SELECT 
        "id" AS "field_id",
        "is_array" AS "is_multi_value",
        "is_custom" AS "is_custom_field",
        "name" AS "field_name"
    FROM "field_projected"
)

-- COCOON BLOCK END
SELECT * FROM "field_projected_renamed"

stg_field.yml (Document the table)

version: 2
models:
- name: stg_field
  description: The table is about fields in an issue tracking system. It contains
    information on field IDs, names, and properties. Each row represents a different
    field. The "is_array" and "is_custom" columns are boolean flags indicating field
    characteristics. The table includes standard fields like summary, components,
    and creator.
  columns:
  - name: field_id
    description: Unique identifier for the field
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each field in the
        issue tracking system. For this table, each row represents a different field.
        The field_id is likely to be unique across rows as it's designed to be an
        identifier.
  - name: is_multi_value
    description: Boolean flag indicating if field can have multiple values
    tests:
    - not_null
  - name: is_custom_field
    description: Boolean flag indicating if field is custom-defined
    tests:
    - not_null
  - name: field_name
    description: Human-readable name of the field
    tests:
    - not_null

stg_project (first 100 rows)

	project_id	project_key	project_name	permission_scheme_id	project_type	category_id
0	10008	TBTP	Test Project	10001	software	None
1	10005	TCP	Test Project	10001	software	None
2	10001	TP	Test Project	10001	software	10000.0

stg_project.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"project_projected" AS (
    -- Projection: Selecting 8 out of 9 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "key_",
        "lead_id",
        "name",
        "permission_scheme_id",
        "project_category_id",
        "project_type_key"
    FROM "project"
),

"project_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> project_id
    -- description -> project_description
    -- key_ -> project_key
    -- name -> project_name
    -- project_category_id -> category_id
    -- project_type_key -> project_type
    SELECT 
        "id" AS "project_id",
        "description" AS "project_description",
        "key_" AS "project_key",
        "lead_id",
        "name" AS "project_name",
        "permission_scheme_id",
        "project_category_id" AS "category_id",
        "project_type_key" AS "project_type"
    FROM "project_projected"
),

"project_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- project_name: The problem is inconsistent capitalization and specificity in project names. "Test Project" is capitalized and generic, while "test bug tracking project" and "test classique project" are lowercase and more specific. The most frequent and generic form is "Test Project", which should be used for consistency. 
    SELECT
        "project_id",
        "project_description",
        "project_key",
        "lead_id",
        CASE
            WHEN "project_name" = 'test bug tracking project' THEN 'Test Project'
            WHEN "project_name" = 'test classique project' THEN 'Test Project'
            ELSE "project_name"
        END AS "project_name",
        "permission_scheme_id",
        "category_id",
        "project_type"
    FROM "project_projected_renamed"
),

"project_projected_renamed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- category_id: from DECIMAL to VARCHAR
    -- lead_id: from VARCHAR to UUID
    -- project_description: from DECIMAL to VARCHAR
    SELECT
        "project_id",
        "project_key",
        "project_name",
        "permission_scheme_id",
        "project_type",
        CAST("category_id" AS VARCHAR) AS "category_id",
        CASE 
            WHEN "lead_id" ~ ':[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$'
            THEN CAST(REGEXP_EXTRACT("lead_id", '([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})$') AS UUID)
            ELSE TRY_CAST("lead_id" AS UUID)
        END AS "lead_id",
        CAST("project_description" AS VARCHAR) AS "project_description"
    FROM "project_projected_renamed_cleaned"
),

"project_projected_renamed_cleaned_casted_missing_handled" AS (
    -- Handling missing values: There are 2 columns with unacceptable missing values
    -- lead_id has 100.0 percent missing. Strategy: 🗑️ Drop Column
    -- project_description has 100.0 percent missing. Strategy: 🗑️ Drop Column
    SELECT
        "project_id",
        "project_key",
        "project_name",
        "permission_scheme_id",
        "project_type",
        "category_id"
    FROM "project_projected_renamed_cleaned_casted"
)

-- COCOON BLOCK END
SELECT * FROM "project_projected_renamed_cleaned_casted_missing_handled"

stg_project.yml (Document the table)

version: 2
models:
- name: stg_project
  description: The table is about projects. Each project has an ID, description, key,
    lead ID, name, permission scheme ID, project category ID, and project type key.
    The table includes details like project names, keys, and associated IDs. All projects
    listed are software projects with different names and keys.
  columns:
  - name: project_id
    description: Unique identifier for the project
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each project. For
        this table, each row represents a distinct project, and project_id is unique
        across rows.
  - name: project_key
    description: Short code or abbreviation for the project
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a short code or abbreviation for the project.
        For this table, each row is a distinct project, and the project_key appears
        to be unique across rows based on the sample data.
  - name: project_name
    description: Full name of the project
    tests:
    - not_null
  - name: permission_scheme_id
    description: ID of the permission scheme applied to the project
    tests:
    - not_null
  - name: project_type
    description: Key indicating the type of project
    tests:
    - not_null
    - accepted_values:
        values:
        - software
        - hardware
        - web
        - mobile
        - desktop
        - embedded
        - cloud
        - AI/ML
        - data science
        - IoT
        - cybersecurity
        - blockchain
        - game development
        - enterprise
        - e-commerce
        - social media
        - education
        - healthcare
        - fintech
        - research
        - open source
  - name: category_id
    description: ID of the category the project belongs to
    cocoon_meta:
      missing_acceptable: Not all projects require categorization.

stg_version (first 100 rows)

	version_id	is_archived	version_description	version_name	is_overdue	project_id	is_released	release_date	start_date
0	10001	False	second version upd	Version 2.0	True	10001	False	2020-06-01	2020-05-01
1	10003	False	None	Version 2.0	True	10006	False	2020-11-17	NaT
2	10004	False	None	Version 3.0	False	10006	False	NaT	NaT

stg_version.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"version_projected" AS (
    -- Projection: Selecting 9 out of 10 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "archived",
        "description",
        "name",
        "overdue",
        "project_id",
        "release_date",
        "released",
        "start_date"
    FROM "version"
),

"version_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> version_id
    -- archived -> is_archived
    -- description -> version_description
    -- name -> version_name
    -- overdue -> is_overdue
    -- released -> is_released
    SELECT 
        "id" AS "version_id",
        "archived" AS "is_archived",
        "description" AS "version_description",
        "name" AS "version_name",
        "overdue" AS "is_overdue",
        "project_id",
        "release_date",
        "released" AS "is_released",
        "start_date"
    FROM "version_projected"
),

"version_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- version_name: The problem is inconsistent formatting for version 2.0. It appears as both "Version 2.0" and "v2.0". The correct values should follow the most common format, which is "Version X.0". "Version 2.0" is already in the correct format and is the most frequent, so "v2.0" should be changed to match it. 
    SELECT
        "version_id",
        "is_archived",
        "version_description",
        CASE
            WHEN "version_name" = 'v2.0' THEN 'Version 2.0'
            ELSE "version_name"
        END AS "version_name",
        "is_overdue",
        "project_id",
        "release_date",
        "is_released",
        "start_date"
    FROM "version_projected_renamed"
),

"version_projected_renamed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- release_date: from VARCHAR to DATE
    -- start_date: from VARCHAR to DATE
    SELECT
        "version_id",
        "is_archived",
        "version_description",
        "version_name",
        "is_overdue",
        "project_id",
        "is_released",
        CAST("release_date" AS DATE) AS "release_date",
        CAST("start_date" AS DATE) AS "start_date"
    FROM "version_projected_renamed_cleaned"
)

-- COCOON BLOCK END
SELECT * FROM "version_projected_renamed_cleaned_casted"

stg_version.yml (Document the table)

version: 2
models:
- name: stg_version
  description: The table is about software versions. It includes details like version
    ID, name, description, project ID, release dates, and status flags. Each row represents
    a specific version with its attributes. The table tracks whether versions are
    archived, overdue, or released, along with their start and release dates.
  columns:
  - name: version_id
    description: Unique identifier for the version
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each software version.
        For this table, each row represents a specific version of software. The version_id
        is likely to be unique across rows as it's designed to distinctly identify
        each version.
  - name: is_archived
    description: Indicates if the version is archived
    tests:
    - not_null
  - name: version_description
    description: Brief description of the version
    tests:
    - not_null
  - name: version_name
    description: Name or number of the version
    tests:
    - not_null
  - name: is_overdue
    description: Indicates if the version is past due date
    tests:
    - not_null
  - name: project_id
    description: Identifier of the project the version belongs to
    tests:
    - not_null
  - name: is_released
    description: Indicates if the version has been released
    tests:
    - not_null
  - name: release_date
    description: Planned or actual date of version release
    cocoon_meta:
      missing_acceptable: Not applicable for unreleased versions.
  - name: start_date
    description: Date when work on the version began
    tests:
    - not_null

stg_status_category (first 100 rows)

	status_id	status_name
0	3	Done
1	2	To Do
2	4	In Progress

stg_status_category.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"status_category_projected" AS (
    -- Projection: Selecting 2 out of 3 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "name"
    FROM "status_category"
),

"status_category_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> status_id
    -- name -> status_name
    SELECT 
        "id" AS "status_id",
        "name" AS "status_name"
    FROM "status_category_projected"
)

-- COCOON BLOCK END
SELECT * FROM "status_category_projected_renamed"

stg_status_category.yml (Document the table)

version: 2
models:
- name: stg_status_category
  description: The table is about status categories. It contains different stages
    of task completion. Each category has an ID and a name. The categories include
    "Done", "To Do", and "In Progress". This table likely serves as a lookup for task
    management systems to assign status to tasks or projects.
  columns:
  - name: status_id
    description: Unique identifier for each status category
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each status category.
        For this table, each row represents a distinct status category. The status_id
        is likely to be unique across rows as it serves as an identifier.
  - name: status_name
    description: Descriptive name of the status category
    tests:
    - not_null
    - unique
    - accepted_values:
        values:
        - Done
        - In Progress
        - To Do
        - Not Started
        - Blocked
        - Cancelled
        - On Hold
        - Deferred
        - Under Review
        - Completed
        - Pending Approval
    cocoon_meta:
      uniqueness: This column contains the descriptive name of the status category.
        For this table, each row represents a distinct status category. While status
        names are likely to be unique in this context, it's possible (though unlikely)
        that two different statuses could have the same name in some systems.

stg_sprint (first 100 rows)

	sprint_id	board_id	sprint_name	completion_date	end_date	start_date
0	4	5	TNGP Sprint 1	NaT	2020-11-26 21:25:12	2020-11-12 21:25:17.778
1	1	0	Sample Sprint 2	2020-11-10 20:52:38.262	2020-11-17 10:29:44	2020-11-03 10:09:44.653
2	3	7	Sample Sprint 2	NaT	2020-11-26 21:55:00	2020-11-12 21:55:42.220

stg_sprint.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"sprint_projected" AS (
    -- Projection: Selecting 6 out of 7 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "board_id",
        "complete_date",
        "end_date",
        "name",
        "start_date"
    FROM "sprint"
),

"sprint_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> sprint_id
    -- complete_date -> completion_date
    -- name -> sprint_name
    SELECT 
        "id" AS "sprint_id",
        "board_id",
        "complete_date" AS "completion_date",
        "end_date",
        "name" AS "sprint_name",
        "start_date"
    FROM "sprint_projected"
),

"sprint_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- end_date: The problem is that one of the date-time values ('2020-11-17 10:29:44.653') includes milliseconds, while the others don't. The correct values should all follow the same format without milliseconds, which is the most common format in the dataset. 
    SELECT
        "sprint_id",
        "board_id",
        "completion_date",
        CASE
            WHEN "end_date" = '2020-11-17 10:29:44.653' THEN '2020-11-17 10:29:44'
            ELSE "end_date"
        END AS "end_date",
        "sprint_name",
        "start_date"
    FROM "sprint_projected_renamed"
),

"sprint_projected_renamed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- completion_date: from VARCHAR to TIMESTAMP
    -- end_date: from VARCHAR to TIMESTAMP
    -- start_date: from VARCHAR to TIMESTAMP
    SELECT
        "sprint_id",
        "board_id",
        "sprint_name",
        CAST("completion_date" AS TIMESTAMP) AS "completion_date",
        CAST("end_date" AS TIMESTAMP) AS "end_date",
        CAST("start_date" AS TIMESTAMP) AS "start_date"
    FROM "sprint_projected_renamed_cleaned"
)

-- COCOON BLOCK END
SELECT * FROM "sprint_projected_renamed_cleaned_casted"

stg_sprint.yml (Document the table)

version: 2
models:
- name: stg_sprint
  description: The table is about sprints. It contains details of each sprint, including
    an ID, associated board ID, start date, end date, and name. Some sprints have
    a complete date, while others do not. The table tracks the timeline and status
    of different sprints across various project boards.
  columns:
  - name: sprint_id
    description: Unique identifier for the sprint
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column is a unique identifier for each sprint. For this table,
        each row represents a distinct sprint, and sprint_id is unique across rows.
  - name: board_id
    description: Identifier for the associated project board
    tests:
    - not_null
  - name: sprint_name
    description: Name or title of the sprint
    tests:
    - not_null
  - name: completion_date
    description: Date when the sprint was finished
    cocoon_meta:
      missing_acceptable: Sprint may not be completed yet.
  - name: end_date
    description: Planned end date of the sprint
    tests:
    - not_null
  - name: start_date
    description: Date when the sprint began
    tests:
    - not_null

stg_project_category (first 100 rows)

	category_id
0	10000

stg_project_category.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"project_category_projected" AS (
    -- Projection: Selecting 3 out of 4 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "name"
    FROM "project_category"
),

"project_category_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> category_id
    -- description -> category_description
    -- name -> category_name
    SELECT 
        "id" AS "category_id",
        "description" AS "category_description",
        "name" AS "category_name"
    FROM "project_category_projected"
),

"project_category_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- category_description: The problem is that the category_description column contains a question about database structure instead of an actual category description. This is completely inappropriate for a category description field. The correct value should be an empty string or NULL, as we don't have the actual category description information. 
    -- category_name: The problem is that 'test-category' appears to be a placeholder value rather than a genuine category name. It's likely used as a temporary or default value during testing or development, and doesn't represent an actual product category. In a real dataset, we would expect to see meaningful category names that describe the types of products being sold. 
    SELECT
        "category_id",
        CASE
            WHEN "category_description" = 'will this create a project_category table?' THEN ''
            ELSE "category_description"
        END AS "category_description",
        CASE
            WHEN "category_name" = 'test-category' THEN ''
            ELSE "category_name"
        END AS "category_name"
    FROM "project_category_projected_renamed"
),

"project_category_projected_renamed_cleaned_null" AS (
    -- NULL Imputation: Impute Null to Disguised Missing Values
    -- category_description: ['']
    -- category_name: ['']
    SELECT 
        CASE
            WHEN "category_description" = '' THEN NULL
            ELSE "category_description"
        END AS "category_description",
        CASE
            WHEN "category_name" = '' THEN NULL
            ELSE "category_name"
        END AS "category_name",
        "category_id"
    FROM "project_category_projected_renamed_cleaned"
),

"project_category_projected_renamed_cleaned_null_missing_handled" AS (
    -- Handling missing values: There are 2 columns with unacceptable missing values
    -- category_description has 100.0 percent missing. Strategy: 🗑️ Drop Column
    -- category_name has 100.0 percent missing. Strategy: 🗑️ Drop Column
    SELECT
        "category_id"
    FROM "project_category_projected_renamed_cleaned_null"
)

-- COCOON BLOCK END
SELECT * FROM "project_category_projected_renamed_cleaned_null_missing_handled"

stg_project_category.yml (Document the table)

version: 2
models:
- name: stg_project_category
  description: The table is about project categories. It contains unique identifiers
    for each category. It stores the name of each category. It includes a description
    for each category. The table likely serves to organize and classify different
    types of projects within a system.
  columns:
  - name: category_id
    description: Unique identifier for each project category
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each project category.
        For this table, each row corresponds to a distinct project category. The category_id
        is unique across rows, as it's described as a "Unique identifier for each
        project category".

stg_issue_field_history (first 100 rows)

	field_identifier	new_field_value	change_timestamp	issue_identifier
0	created	2020-11-10T19:19:41.472Z	2020-11-10 19:19:41.472	10027
1	customfield_10104	3.0	2020-11-10 19:19:41.472	10027
2	summary	As a developer, I'd like to update story status during the sprint >> Click the Active sprints link at the top right of the screen to go to the Active sprints where the current Sprint's items can be updated	2020-11-10 19:19:41.472	10027
3	components	10019	2020-11-10 19:19:41.472	10018
4	summary	10026	2020-11-10 19:19:41.472	10018
5	epic	1111111	2020-11-10 19:19:41.472	10015
6	epiclink	https://ulr-here/epic-key	2020-11-10 19:19:41.472	10018

stg_issue_field_history.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"issue_field_history_projected" AS (
    -- Projection: Selecting 4 out of 5 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "field_id",
        "issue_id",
        "time_",
        "value_"
    FROM "issue_field_history"
),

"issue_field_history_projected_renamed" AS (
    -- Rename: Renaming columns
    -- field_id -> field_identifier
    -- issue_id -> issue_identifier
    -- time_ -> change_timestamp
    -- value_ -> new_field_value
    SELECT 
        "field_id" AS "field_identifier",
        "issue_id" AS "issue_identifier",
        "time_" AS "change_timestamp",
        "value_" AS "new_field_value"
    FROM "issue_field_history_projected"
),

"issue_field_history_projected_renamed_casted" AS (
    -- Column Type Casting: 
    -- change_timestamp: from VARCHAR to TIMESTAMP
    -- issue_identifier: from INT to VARCHAR
    SELECT
        "field_identifier",
        "new_field_value",
        CAST("change_timestamp" AS TIMESTAMP) AS "change_timestamp",
        CAST("issue_identifier" AS VARCHAR) AS "issue_identifier"
    FROM "issue_field_history_projected_renamed"
)

-- COCOON BLOCK END
SELECT * FROM "issue_field_history_projected_renamed_casted"

stg_issue_field_history.yml (Document the table)

version: 2
models:
- name: stg_issue_field_history
  description: The table is about issue field history. It tracks changes to various
    fields of issues in a tracking system. Each row represents a change, including
    the field ID, issue ID, timestamp, and new value. Fields can be system fields
    like "created" or custom fields. The table allows for tracking the evolution of
    issues over time.
  columns:
  - name: field_identifier
    description: Identifier for the issue field
    tests:
    - not_null
  - name: new_field_value
    description: New value of the changed field
    tests:
    - not_null
  - name: change_timestamp
    description: Timestamp of the field change
    tests:
    - not_null
  - name: issue_identifier
    description: Unique identifier for the issue
    tests:
    - not_null

stg_field_option (first 100 rows)

	field_id	option_name
0	10104	Option 2
1	10016	To Do
2	10106	Option 4
3	10019	Impediment
4	10026	Summary

stg_field_option.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"field_option_projected" AS (
    -- Projection: Selecting 2 out of 3 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "name"
    FROM "field_option"
),

"field_option_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> field_id
    -- name -> option_name
    SELECT 
        "id" AS "field_id",
        "name" AS "option_name"
    FROM "field_option_projected"
),

"field_option_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- option_name: The problem is that the option_name column contains inconsistent values. 'This is the summary of the issue.' is a full sentence among shorter labels, which is unusual. 'opt 2' and 'opt 4' are inconsistently named compared to the other options. The correct values should be concise labels that describe the option or status. 'Impediment' and 'To Do' are already in the correct format. 
    SELECT
        "field_id",
        CASE
            WHEN "option_name" = 'This is the summary of the issue.' THEN 'Summary'
            WHEN "option_name" = 'opt 2' THEN 'Option 2'
            WHEN "option_name" = 'opt 4' THEN 'Option 4'
            ELSE "option_name"
        END AS "option_name"
    FROM "field_option_projected_renamed"
)

-- COCOON BLOCK END
SELECT * FROM "field_option_projected_renamed_cleaned"

stg_field_option.yml (Document the table)

version: 2
models:
- name: stg_field_option
  description: The table is about field options. It contains IDs and names for various
    options. These options appear to be status labels or descriptors, possibly used
    in a task management or issue tracking system. Each option has a unique numeric
    ID and a corresponding text name.
  columns:
  - name: field_id
    description: Unique numeric identifier for each field
    tests:
    - not_null
  - name: option_name
    description: Text label or description of the option
    tests:
    - not_null
    - accepted_values:
        values:
        - Impediment
        - Option 2
        - Option 4
        - Summary
        - To Do
        - In Progress
        - Done
        - Blocked
        - Backlog
        - Ready for Review
        - Under Review
        - Approved
        - Rejected
        - Deferred

stg_issue_link (first 100 rows)

	issue_id	related_issue_id	relationship_type
0	10002	10004	duplicates
1	10004	10006	duplicates
2	10004	10005	duplicates

stg_issue_link.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"issue_link_projected" AS (
    -- Projection: Selecting 3 out of 4 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "issue_id",
        "related_issue_id",
        "relationship"
    FROM "issue_link"
),

"issue_link_projected_renamed" AS (
    -- Rename: Renaming columns
    -- relationship -> relationship_type
    SELECT 
        "issue_id",
        "related_issue_id",
        "relationship" AS "relationship_type"
    FROM "issue_link_projected"
)

-- COCOON BLOCK END
SELECT * FROM "issue_link_projected_renamed"

stg_issue_link.yml (Document the table)

version: 2
models:
- name: stg_issue_link
  description: The table is about issue relationships. It shows how different issues
    are linked. Each row connects two issues with an "issue_id" and a "related_issue_id".
    The "relationship" column describes how they are connected, such as "duplicates".
    This structure allows tracking of issue dependencies and similarities across a
    project or system.
  columns:
  - name: issue_id
    description: Identifier for the primary issue
    tests:
    - not_null
  - name: related_issue_id
    description: Identifier for the related issue
    tests:
    - not_null
  - name: relationship_type
    description: Type of connection between the issues
    tests:
    - not_null
    - accepted_values:
        values:
        - duplicates
        - blocks
        - is blocked by
        - relates to
        - parent of
        - child of
        - clones
        - is cloned by
        - precedes
        - follows
        - causes
        - is caused by
        - similar to
        - depends on
        - is depended on by

stg_issue_multiselect_history (first 100 rows)

	record_id	field_id	issue_id	change_timestamp
0	w4UiT+hPMxJp3RhL/YFJm3uWL5U=	subtasks	10027	2020-11-10 19:19:41.472
1	4pVgGn0qSqR2hCmMdo4wWHXmgew=	customfield_10021	10027	2020-11-10 19:19:41.472
2	/zrY8m6q0VMW6ia1jGIerXqLIgQ=	customfield_10020	10027	2020-11-10 19:19:41.472
3	3p3gGn0qSqR2hCmMdo4wWHXa32m=	components	10027	2020-11-10 19:19:41.472

stg_issue_multiselect_history.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"issue_multiselect_history_projected" AS (
    -- Projection: Selecting 5 out of 6 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "_fivetran_id",
        "time_",
        "field_id",
        "issue_id",
        "value_"
    FROM "issue_multiselect_history"
),

"issue_multiselect_history_projected_renamed" AS (
    -- Rename: Renaming columns
    -- _fivetran_id -> record_id
    -- time_ -> change_timestamp
    -- value_ -> new_field_value
    SELECT 
        "_fivetran_id" AS "record_id",
        "time_" AS "change_timestamp",
        "field_id",
        "issue_id",
        "value_" AS "new_field_value"
    FROM "issue_multiselect_history_projected"
),

"issue_multiselect_history_projected_renamed_casted" AS (
    -- Column Type Casting: 
    -- change_timestamp: from VARCHAR to TIMESTAMP
    -- new_field_value: from INT to VARCHAR
    SELECT
        "record_id",
        "field_id",
        "issue_id",
        CAST("change_timestamp" AS TIMESTAMP) AS "change_timestamp",
        CAST("new_field_value" AS VARCHAR) AS "new_field_value"
    FROM "issue_multiselect_history_projected_renamed"
)

-- COCOON BLOCK END
SELECT * FROM "issue_multiselect_history_projected_renamed_casted"

stg_issue_multiselect_history.yml (Document the table)

version: 2
models:
- name: stg_issue_multiselect_history
  description: The table is about issue field value history. It tracks changes to
    various fields of an issue over time. Each row represents a change, recording
    the field ID, issue ID, new value, and timestamp. The _fivetran_id appears to
    be a unique identifier for each record.
  columns:
  - name: record_id
    description: Unique identifier for each record
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column appears to be a unique identifier for each change record.
        For this table, each row represents a single change to an issue field. The
        record_id seems to be unique across rows, as it's described as a unique identifier
        for each record.
  - name: field_id
    description: Identifier of the issue field that changed
    tests:
    - not_null
  - name: issue_id
    description: Identifier of the issue being tracked
    tests:
    - not_null
  - name: change_timestamp
    description: Timestamp of when the field value changed
    tests:
    - not_null
  - name: new_field_value
    description: New value of the field after the change
    tests:
    - not_null

stg_user_group (first 100 rows)

	group_name	user_identifier
0	jira-software-users	5faf114ad74f69007579568f
1	site-admins	5d577670dbb98c0d9c232bb2
2	confluence-users	557058:a9572f6a-3041-435a-96ed-5286e3811b33

stg_user_group.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"user_group_projected" AS (
    -- Projection: Selecting 2 out of 3 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "group_name",
        "user_id"
    FROM "user_group"
),

"user_group_projected_renamed" AS (
    -- Rename: Renaming columns
    -- user_id -> user_identifier
    SELECT 
        "group_name",
        "user_id" AS "user_identifier"
    FROM "user_group_projected"
)

-- COCOON BLOCK END
SELECT * FROM "user_group_projected_renamed"

stg_user_group.yml (Document the table)

version: 2
models:
- name: stg_user_group
  description: The table represents a many-to-many relationship between user groups
    and users. It shows which users belong to which groups. The 'group_name' column
    contains the names of different user groups. The 'user_id' column holds unique
    identifiers for individual users. This structure allows users to be part of multiple
    groups and groups to have multiple users.
  columns:
  - name: group_name
    description: Name of the user group
    tests:
    - not_null
    - accepted_values:
        values:
        - confluence-users
        - jira-software-users
        - site-admins
        - bitbucket-users
        - bamboo-users
        - fisheye-users
        - crucible-users
        - crowd-users
        - jira-core-users
        - jira-service-desk-users
        - confluence-administrators
        - jira-administrators
        - bitbucket-administrators
        - bamboo-administrators
        - fisheye-administrators
        - crucible-administrators
        - crowd-administrators
        - system-administrators
        - project-administrators
        - read-only-users
        - developers
        - testers
        - product-managers
        - support-team
        - marketing-team
        - sales-team
        - hr-team
  - name: user_identifier
    description: Unique identifier for individual users
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column contains unique identifiers for individual users. For
        this table, each row represents a relationship between a user group and a
        user. The user_identifier is likely to be unique across rows, as it's described
        as a unique identifier for users.

stg_priority (first 100 rows)

	priority_id	priority_impact	priority_level
0	2	Serious problem that could block progress.	High
1	5	Trivial problem with little or no impact on progress.	Lowest
2	1	This problem will block progress.	Highest
3	3	Has the potential to affect progress.	Medium
4	4	Minor problem or easily worked around.	Low

stg_priority.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"priority_projected" AS (
    -- Projection: Selecting 3 out of 4 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "name"
    FROM "priority"
),

"priority_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> priority_id
    -- description -> priority_impact
    -- name -> priority_level
    SELECT 
        "id" AS "priority_id",
        "description" AS "priority_impact",
        "name" AS "priority_level"
    FROM "priority_projected"
)

-- COCOON BLOCK END
SELECT * FROM "priority_projected_renamed"

stg_priority.yml (Document the table)

version: 2
models:
- name: stg_priority
  description: The table is about priority levels. It contains five priority levels,
    each with an ID, description, and name. The levels range from Lowest to Highest.
    Descriptions explain the impact of each priority level on progress. The table
    helps categorize and understand the severity of problems or issues.
  columns:
  - name: priority_id
    description: Unique identifier for each priority level
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each priority level.
        For this table, each row corresponds to a distinct priority level. The priority_id
        is unique across rows as it's designed to be an identifier.
  - name: priority_impact
    description: Explains the impact of the priority level
    tests:
    - not_null
    - accepted_values:
        values:
        - Has the potential to affect progress.
        - Minor problem or easily worked around.
        - Serious problem that could block progress.
        - This problem will block progress.
        - Trivial problem with little or no impact on progress.
  - name: priority_level
    description: Short label for the priority level
    tests:
    - not_null
    - accepted_values:
        values:
        - High
        - Highest
        - Low
        - Lowest
        - Medium

stg_resolution (first 100 rows)

	resolution_id	status_description	status_name
0	10000	Work has been completed on this issue.	Done

stg_resolution.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"resolution_projected" AS (
    -- Projection: Selecting 3 out of 4 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "name"
    FROM "resolution"
),

"resolution_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> resolution_id
    -- description -> status_description
    -- name -> status_name
    SELECT 
        "id" AS "resolution_id",
        "description" AS "status_description",
        "name" AS "status_name"
    FROM "resolution_projected"
)

-- COCOON BLOCK END
SELECT * FROM "resolution_projected_renamed"

stg_resolution.yml (Document the table)

version: 2
models:
- name: stg_resolution
  description: The table is about resolution statuses for issues or tasks. It contains
    details of different resolution states. Each status has an ID, a description explaining
    what it means, and a short name. The example shows the "Done" status, indicating
    when work on an issue is completed.
  columns:
  - name: resolution_id
    description: Unique identifier for the resolution status
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each resolution status.
        For this table, each row represents a distinct resolution status. The resolution_id
        is designed to be unique across rows, as it's typically used to uniquely identify
        each status.
  - name: status_description
    description: Detailed explanation of the resolution status
    tests:
    - not_null
  - name: status_name
    description: Short name of the resolution status
    tests:
    - not_null
    - accepted_values:
        values:
        - To Do
        - In Progress
        - Done
        - Blocked
        - On Hold
        - Cancelled
        - Rejected
        - Reopened
        - Resolved
        - Closed

stg_issue_type (first 100 rows)

	issue_description	issue_type	is_subtask	issue_type_id
0	A problem or error.	Bug	False	10117
1	A problem or error.	Bug	False	10106
2	Track large pieces of work.	Epic	False	10101

stg_issue_type.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"issue_type_projected" AS (
    -- Projection: Selecting 4 out of 5 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "name",
        "subtask"
    FROM "issue_type"
),

"issue_type_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> issue_type_id
    -- description -> issue_description
    -- name -> issue_type
    -- subtask -> is_subtask
    SELECT 
        "id" AS "issue_type_id",
        "description" AS "issue_description",
        "name" AS "issue_type",
        "subtask" AS "is_subtask"
    FROM "issue_type_projected"
),

"issue_type_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- issue_description: The problem is that 'A problem or error.' and 'Bugs track problems or errors.' are redundant representations of the same concept. They both refer to issues or bugs in a system. The correct values should be distinct and represent different types of issues. 'Track large pieces of work.' is a distinct value and doesn't need to be changed. 
    SELECT
        "issue_type_id",
        CASE
            WHEN "issue_description" = 'Bugs track problems or errors.' THEN 'A problem or error.'
            ELSE "issue_description"
        END AS "issue_description",
        "issue_type",
        "is_subtask"
    FROM "issue_type_projected_renamed"
),

"issue_type_projected_renamed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- issue_type_id: from INT to VARCHAR
    SELECT
        "issue_description",
        "issue_type",
        "is_subtask",
        CAST("issue_type_id" AS VARCHAR) AS "issue_type_id"
    FROM "issue_type_projected_renamed_cleaned"
)

-- COCOON BLOCK END
SELECT * FROM "issue_type_projected_renamed_cleaned_casted"

stg_issue_type.yml (Document the table)

version: 2
models:
- name: stg_issue_type
  description: The table is about issue types. It contains details of different issue
    categories. Each issue type has an ID, description, name, and a flag indicating
    if it's a subtask. The table includes examples like Bug and Epic. It provides
    a brief explanation of each issue type's purpose or characteristics.
  columns:
  - name: issue_description
    description: Brief explanation of the issue type
    tests:
    - not_null
    - accepted_values:
        values:
        - A problem or error
        - Track large pieces of work
        - A new feature request
        - An improvement suggestion
        - A task to be completed
        - A question or inquiry
        - Documentation update
        - Security vulnerability
        - Performance issue
        - User interface/experience issue
        - Testing/QA task
        - Maintenance task
        - Integration issue
        - Configuration change
        - Data-related issue
        - A problem or error.
        - Track large pieces of work.
  - name: issue_type
    description: Short name of the issue type
    tests:
    - not_null
    - accepted_values:
        values:
        - Bug
        - Epic
        - Task
        - Story
        - Subtask
        - Improvement
        - New Feature
        - Problem
        - Incident
        - Change Request
        - Defect
        - Risk
        - Spike
  - name: is_subtask
    description: Indicates if the issue type is a subtask
    tests:
    - not_null
  - name: issue_type_id
    description: Unique identifier for the issue type
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each issue type.
        For this table, each row represents a distinct issue type. The issue_type_id
        is likely to be unique across rows as it's described as a "Unique identifier
        for the issue type".

stg_issue (first 100 rows)

	issue_id	is_deleted	creator_id	description	issue_type_id	issue_key	priority_id	project_id	reporter_id	status_id	summary	assignee	created_at	due_date	environment	last_viewed_at	original_estimate	original_estimate_seconds	parent_issue_id	remaining_estimate	remaining_estimate_seconds	resolution_id	resolved_at	status_category_changed_at	time_spent	time_spent_seconds	updated_at	work_ratio
0	10011	False	5c3326c24b248c315badf3ee	Yes, this is an issue to test permissions	10103	TCP-8	3	10001	5c3326c24b248c315badf3ee	3	We want to test permissions	None	2020-04-01 08:33:58	NaT	None	NaT	None	NaN	NaN	None	NaN	NaN	2020-05-29 11:33:53	2020-04-01 08:34:02	None	NaN	2020-05-29 11:33:53	-1.0
1	10015	False	5ed0edbe2999b60c23255b4f	None	10103	TCP-12	5	10001	5ed0edbe2999b60c23255b4f	10003	Part of test epic	None	2020-06-11 07:38:03	NaT	None	NaT	None	NaN	NaN	None	NaN	NaN	2020-06-11 07:38:03	2020-06-11 07:38:03	None	NaN	2020-06-11 07:38:03	-1.0
2	10018	False	557058:a9572f6a-3041-435a-96ed-5286e3811b33	None	10103	TCP-1	3	10005	557058:a9572f6a-3041-435a-96ed-5286e3811b33	10003	this is a story	None	2020-11-10 19:11:45	NaT	None	2020-11-10 19:11:46	None	NaN	NaN	None	NaN	10000.0	2020-11-28 19:11:45	2020-11-28 19:11:46	None	NaN	2020-11-28 19:11:45	-1.0

stg_issue.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"issue_projected" AS (
    -- Projection: Selecting 28 out of 29 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "_fivetran_deleted",
        "_original_estimate",
        "_remaining_estimate",
        "_time_spent",
        "assignee",
        "created",
        "creator",
        "description",
        "due_date",
        "environment",
        "issue_type",
        "key_",
        "last_viewed",
        "original_estimate",
        "parent_id",
        "priority",
        "project",
        "remaining_estimate",
        "reporter",
        "resolution",
        "resolved",
        "status",
        "status_category_changed",
        "summary",
        "time_spent",
        "updated",
        "work_ratio"
    FROM "issue"
),

"issue_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> issue_id
    -- _fivetran_deleted -> is_deleted
    -- _original_estimate -> original_estimate_seconds
    -- _remaining_estimate -> remaining_estimate_seconds
    -- _time_spent -> time_spent_seconds
    -- created -> created_at
    -- creator -> creator_id
    -- issue_type -> issue_type_id
    -- key_ -> issue_key
    -- last_viewed -> last_viewed_at
    -- parent_id -> parent_issue_id
    -- priority -> priority_id
    -- project -> project_id
    -- reporter -> reporter_id
    -- resolution -> resolution_id
    -- resolved -> resolved_at
    -- status -> status_id
    -- status_category_changed -> status_category_changed_at
    -- updated -> updated_at
    SELECT 
        "id" AS "issue_id",
        "_fivetran_deleted" AS "is_deleted",
        "_original_estimate" AS "original_estimate_seconds",
        "_remaining_estimate" AS "remaining_estimate_seconds",
        "_time_spent" AS "time_spent_seconds",
        "assignee",
        "created" AS "created_at",
        "creator" AS "creator_id",
        "description",
        "due_date",
        "environment",
        "issue_type" AS "issue_type_id",
        "key_" AS "issue_key",
        "last_viewed" AS "last_viewed_at",
        "original_estimate",
        "parent_id" AS "parent_issue_id",
        "priority" AS "priority_id",
        "project" AS "project_id",
        "remaining_estimate",
        "reporter" AS "reporter_id",
        "resolution" AS "resolution_id",
        "resolved" AS "resolved_at",
        "status" AS "status_id",
        "status_category_changed" AS "status_category_changed_at",
        "summary",
        "time_spent",
        "updated" AS "updated_at",
        "work_ratio"
    FROM "issue_projected"
),

"issue_projected_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- issue_key: The problem is inconsistent prefixes in the issue_key column. The prefixes 'TCP' and 'TP' are likely meant to represent the same concept but are inconsistently applied. Given that 'TCP' appears in the most frequent value ('TCP-1'), it's likely the intended prefix. The correct values should all use the 'TCP' prefix for consistency. 
    SELECT
        "issue_id",
        "is_deleted",
        "original_estimate_seconds",
        "remaining_estimate_seconds",
        "time_spent_seconds",
        "assignee",
        "created_at",
        "creator_id",
        "description",
        "due_date",
        "environment",
        "issue_type_id",
        CASE
            WHEN "issue_key" = 'TP-12' THEN 'TCP-12'
            WHEN "issue_key" = 'TP-8' THEN 'TCP-8'
            ELSE "issue_key"
        END AS "issue_key",
        "last_viewed_at",
        "original_estimate",
        "parent_issue_id",
        "priority_id",
        "project_id",
        "remaining_estimate",
        "reporter_id",
        "resolution_id",
        "resolved_at",
        "status_id",
        "status_category_changed_at",
        "summary",
        "time_spent",
        "updated_at",
        "work_ratio"
    FROM "issue_projected_renamed"
),

"issue_projected_renamed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- assignee: from DECIMAL to VARCHAR
    -- created_at: from VARCHAR to TIMESTAMP
    -- due_date: from DECIMAL to DATE
    -- environment: from DECIMAL to VARCHAR
    -- last_viewed_at: from VARCHAR to TIMESTAMP
    -- original_estimate: from DECIMAL to VARCHAR
    -- original_estimate_seconds: from DECIMAL to INT
    -- parent_issue_id: from DECIMAL to INT
    -- remaining_estimate: from DECIMAL to VARCHAR
    -- remaining_estimate_seconds: from DECIMAL to INT
    -- resolution_id: from DECIMAL to INT
    -- resolved_at: from VARCHAR to TIMESTAMP
    -- status_category_changed_at: from VARCHAR to TIMESTAMP
    -- time_spent: from DECIMAL to VARCHAR
    -- time_spent_seconds: from DECIMAL to INT
    -- updated_at: from VARCHAR to TIMESTAMP
    -- work_ratio: from INT to DECIMAL
    SELECT
        "issue_id",
        "is_deleted",
        "creator_id",
        "description",
        "issue_type_id",
        "issue_key",
        "priority_id",
        "project_id",
        "reporter_id",
        "status_id",
        "summary",
        CAST("assignee" AS VARCHAR) AS "assignee",
        CAST("created_at" AS TIMESTAMP) AS "created_at",
        CAST("due_date" AS DATE) AS "due_date",
        CAST("environment" AS VARCHAR) AS "environment",
        CAST("last_viewed_at" AS TIMESTAMP) AS "last_viewed_at",
        CAST("original_estimate" AS VARCHAR) AS "original_estimate",
        CAST("original_estimate_seconds" AS INT) AS "original_estimate_seconds",
        CAST("parent_issue_id" AS INT) AS "parent_issue_id",
        CAST("remaining_estimate" AS VARCHAR) AS "remaining_estimate",
        CAST("remaining_estimate_seconds" AS INT) AS "remaining_estimate_seconds",
        CAST("resolution_id" AS INT) AS "resolution_id",
        CAST("resolved_at" AS TIMESTAMP) AS "resolved_at",
        CAST("status_category_changed_at" AS TIMESTAMP) AS "status_category_changed_at",
        CAST("time_spent" AS VARCHAR) AS "time_spent",
        CAST("time_spent_seconds" AS INT) AS "time_spent_seconds",
        CAST("updated_at" AS TIMESTAMP) AS "updated_at",
        CAST("work_ratio" AS DECIMAL) AS "work_ratio"
    FROM "issue_projected_renamed_cleaned"
)

-- COCOON BLOCK END
SELECT * FROM "issue_projected_renamed_cleaned_casted"

stg_issue.yml (Document the table)

version: 2
models:
- name: stg_issue
  description: The table is about issue tracking. It contains details like issue ID,
    type, status, priority, and project. It also includes information on who created
    and assigned the issue, when it was created and updated, and its resolution status.
    The table tracks time estimates and actual time spent on issues.
  columns:
  - name: issue_id
    description: Unique identifier for the issue
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each issue. For this
        table, each row represents a distinct issue, and issue_id is unique across
        rows.
  - name: is_deleted
    description: Indicates if the record was deleted
    tests:
    - not_null
  - name: creator_id
    description: User who created the issue
    tests:
    - not_null
  - name: description
    description: Detailed description of the issue
    tests:
    - not_null
  - name: issue_type_id
    description: Type of the issue (e.g., bug, task)
    tests:
    - not_null
  - name: issue_key
    description: Unique key for the issue
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique key for the issue. It appears to
        be a combination of project code and issue number, which should be unique
        for each issue.
  - name: priority_id
    description: Priority level of the issue
    tests:
    - not_null
  - name: project_id
    description: Project to which the issue belongs
    tests:
    - not_null
  - name: reporter_id
    description: User who reported the issue
    tests:
    - not_null
  - name: status_id
    description: Current status of the issue
    tests:
    - not_null
  - name: summary
    description: Brief summary of the issue
    tests:
    - not_null
  - name: assignee
    description: User assigned to the issue
    cocoon_meta:
      missing_acceptable: Issue may not be assigned to anyone yet
  - name: created_at
    description: Date and time when the issue was created
    tests:
    - not_null
  - name: due_date
    description: Deadline for the issue
    cocoon_meta:
      missing_acceptable: Issue may not have a set deadline
  - name: environment
    description: Environment where the issue occurs
    cocoon_meta:
      missing_acceptable: Issue may not be environment-specific
  - name: last_viewed_at
    description: Date and time when the issue was last viewed
    tests:
    - not_null
  - name: original_estimate
    description: Original time estimate (human-readable format)
    cocoon_meta:
      missing_acceptable: Time estimate may not be required for all issues
  - name: original_estimate_seconds
    description: Original time estimate for the issue
    cocoon_meta:
      missing_acceptable: Time estimate may not be required for all issues
  - name: parent_issue_id
    description: ID of the parent issue
    cocoon_meta:
      missing_acceptable: Issue may not be a sub-task of another issue
  - name: remaining_estimate
    description: Remaining time estimate (human-readable format)
    cocoon_meta:
      missing_acceptable: Time estimate may not be required for all issues
  - name: remaining_estimate_seconds
    description: Remaining time estimate for the issue
    cocoon_meta:
      missing_acceptable: Time estimate may not be required for all issues
  - name: resolution_id
    description: Resolution identifier
    tests:
    - not_null
  - name: resolved_at
    description: Date and time when the issue was resolved
    tests:
    - not_null
  - name: status_category_changed_at
    description: Date and time when status category last changed
    tests:
    - not_null
  - name: time_spent
    description: Time spent on the issue (human-readable format)
    cocoon_meta:
      missing_acceptable: No time may have been logged on the issue
  - name: time_spent_seconds
    description: Actual time spent on the issue
    cocoon_meta:
      missing_acceptable: No time may have been logged on the issue
  - name: updated_at
    description: Date and time when the issue was last updated
    tests:
    - not_null
  - name: work_ratio
    description: Ratio of work done to total work
    tests:
    - not_null

stg_comment (first 100 rows)

	comment_id	comment_text	is_public	issue_id	author_id	creation_timestamp	last_update_author_id	last_update_timestamp
0	10001	Hello	True	10025	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-10 19:19:41.224	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-10 19:19:41.224
1	10013	To Do to In Progress 6 days 22 hours 26 minutes ago In Progress to Done 3 days 16 hours 34 minutes ago	True	10035	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717
2	10012	Joined Sample Sprint 2 7 days 9 hours 10 minutes ago	True	10035	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717

stg_comment.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"comment_projected" AS (
    -- Projection: Selecting 8 out of 9 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "author_id",
        "body",
        "created",
        "is_public",
        "issue_id",
        "update_author_id",
        "updated"
    FROM "comment"
),

"comment_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> comment_id
    -- body -> comment_text
    -- created -> creation_timestamp
    -- update_author_id -> last_update_author_id
    -- updated -> last_update_timestamp
    SELECT 
        "id" AS "comment_id",
        "author_id",
        "body" AS "comment_text",
        "created" AS "creation_timestamp",
        "is_public",
        "issue_id",
        "update_author_id" AS "last_update_author_id",
        "updated" AS "last_update_timestamp"
    FROM "comment_projected"
),

"comment_projected_renamed_casted" AS (
    -- Column Type Casting: 
    -- author_id: from VARCHAR to UUID
    -- creation_timestamp: from VARCHAR to TIMESTAMP
    -- last_update_author_id: from VARCHAR to UUID
    -- last_update_timestamp: from VARCHAR to TIMESTAMP
    SELECT
        "comment_id",
        "comment_text",
        "is_public",
        "issue_id",
        CAST(SPLIT_PART("author_id", ':', 2) AS UUID) AS "author_id",
        CAST("creation_timestamp" AS TIMESTAMP) AS "creation_timestamp",
        CAST(SUBSTRING("last_update_author_id" FROM POSITION(':' IN "last_update_author_id") + 1) AS UUID) AS "last_update_author_id",
        CAST("last_update_timestamp" AS TIMESTAMP) AS "last_update_timestamp"
    FROM "comment_projected_renamed"
)

-- COCOON BLOCK END
SELECT * FROM "comment_projected_renamed_casted"

stg_comment.yml (Document the table)

version: 2
models:
- name: stg_comment
  description: The table is about comments on issues. Each comment has an ID, author,
    body text, creation date, public visibility status, associated issue ID, update
    author, and update date. Comments are linked to specific issues and can be updated.
    The table tracks both the original creation and any subsequent updates to the
    comment.
  columns:
  - name: comment_id
    description: Unique identifier for the comment
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each comment. For
        this table, each row represents a distinct comment, and comment_id is designed
        to be unique across rows.
  - name: comment_text
    description: Text content of the comment
    tests:
    - not_null
  - name: is_public
    description: Visibility status of the comment
    tests:
    - not_null
  - name: issue_id
    description: ID of the associated issue
    tests:
    - not_null
  - name: author_id
    description: ID of the comment author
    tests:
    - not_null
  - name: creation_timestamp
    description: Timestamp of comment creation
    tests:
    - not_null
  - name: last_update_author_id
    description: ID of the author who last updated the comment
    tests:
    - not_null
  - name: last_update_timestamp
    description: Timestamp of the last update
    tests:
    - not_null

stg_component (first 100 rows)

	component_id	component_name	project_id	component_description
0	10004	Component 5	10001	None
1	10003	Component 4	10001	None
2	10002	Component 3	10001	None
3	10019	Component 2	10001	None

stg_component.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"component_projected" AS (
    -- Projection: Selecting 4 out of 5 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "name",
        "project_id"
    FROM "component"
),

"component_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> component_id
    -- description -> component_description
    -- name -> component_name
    SELECT 
        "id" AS "component_id",
        "description" AS "component_description",
        "name" AS "component_name",
        "project_id"
    FROM "component_projected"
),

"component_projected_renamed_casted" AS (
    -- Column Type Casting: 
    -- component_description: from DECIMAL to VARCHAR
    SELECT
        "component_id",
        "component_name",
        "project_id",
        CAST("component_description" AS VARCHAR) AS "component_description"
    FROM "component_projected_renamed"
)

-- COCOON BLOCK END
SELECT * FROM "component_projected_renamed_casted"

stg_component.yml (Document the table)

version: 2
models:
- name: stg_component
  description: The table is about components. Each component has an ID, name, and
    description. Components are associated with projects through a project_id. The
    description field appears to be empty for these samples. The table likely represents
    a list of components within a project management or software development system.
  columns:
  - name: component_id
    description: Unique identifier for the component
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each component. For
        this table, each row represents a distinct component, and component_id appears
        to be unique across rows as it's an ID field.
  - name: component_name
    description: Name of the component
    tests:
    - not_null
  - name: project_id
    description: Identifier of the project the component belongs to
    tests:
    - not_null
  - name: component_description
    description: Brief description of the component
    cocoon_meta:
      missing_acceptable: Optional field, not all components require a description.

stg_status (first 100 rows)

	status_description	status_name	category_id	status_id
0	None	In Progress	4	10010
1	This issue is being actively worked on at the moment by the assignee.	In Progress	4	3
2	None	To Do	2	10000

stg_status.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"status_projected" AS (
    -- Projection: Selecting 4 out of 5 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "description",
        "name",
        "status_category_id"
    FROM "status"
),

"status_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> status_id
    -- description -> status_description
    -- name -> status_name
    -- status_category_id -> category_id
    SELECT 
        "id" AS "status_id",
        "description" AS "status_description",
        "name" AS "status_name",
        "status_category_id" AS "category_id"
    FROM "status_projected"
),

"status_projected_renamed_casted" AS (
    -- Column Type Casting: 
    -- status_id: from INT to VARCHAR
    SELECT
        "status_description",
        "status_name",
        "category_id",
        CAST("status_id" AS VARCHAR) AS "status_id"
    FROM "status_projected_renamed"
)

-- COCOON BLOCK END
SELECT * FROM "status_projected_renamed_casted"

stg_status.yml (Document the table)

version: 2
models:
- name: stg_status
  description: The table is about issue statuses. It contains unique identifiers,
    descriptions, names, and category IDs for various statuses. The statuses include
    "In Progress" and "To Do". Some entries have detailed descriptions, while others
    are left blank. Each status is associated with a specific category through the
    status_category_id field.
  columns:
  - name: status_description
    description: Detailed explanation of the status
    cocoon_meta:
      missing_acceptable: Not needed for standard/predefined statuses.
  - name: status_name
    description: Name of the status
    tests:
    - not_null
    - accepted_values:
        values:
        - To Do
        - In Progress
        - Blocked
        - In Review
        - Done
        - Cancelled
  - name: category_id
    description: ID of the category the status belongs to
    tests:
    - not_null
  - name: status_id
    description: Unique identifier for the status
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column is described as a unique identifier for the status.
        For this table, each row represents a unique status. The status_id is designed
        to be unique across all rows, ensuring that each status has a distinct identifier.

stg_user (first 100 rows)

	user_id	language_preference	full_name	time_zone
0	557058:a9572f6a-3041-435a-96ed-5286e3811b33	en_US	Scooby Doo	America/Los_Angeles
1	5ed0edbe2999b60c23255b4f	en_US	Scrappy Cornelius Doo	America/Los_Angeles
2	5c3326c24b248c315badf3ee	en_US	Norville Shaggy Rogers	America/Los_Angeles

stg_user.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"user_projected" AS (
    -- Projection: Selecting 6 out of 7 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "id",
        "email",
        "locale",
        "name",
        "time_zone",
        "username"
    FROM "user"
),

"user_projected_renamed" AS (
    -- Rename: Renaming columns
    -- id -> user_id
    -- locale -> language_preference
    -- name -> full_name
    SELECT 
        "id" AS "user_id",
        "email",
        "locale" AS "language_preference",
        "name" AS "full_name",
        "time_zone",
        "username"
    FROM "user_projected"
),

"user_projected_renamed_casted" AS (
    -- Column Type Casting: 
    -- email: from DECIMAL to VARCHAR
    -- username: from DECIMAL to VARCHAR
    SELECT
        "user_id",
        "language_preference",
        "full_name",
        "time_zone",
        CAST("email" AS VARCHAR) AS "email",
        CAST("username" AS VARCHAR) AS "username"
    FROM "user_projected_renamed"
),

"user_projected_renamed_casted_missing_handled" AS (
    -- Handling missing values: There are 2 columns with unacceptable missing values
    -- email has 100.0 percent missing. Strategy: 🗑️ Drop Column
    -- username has 100.0 percent missing. Strategy: 🗑️ Drop Column
    SELECT
        "user_id",
        "language_preference",
        "full_name",
        "time_zone"
    FROM "user_projected_renamed_casted"
)

-- COCOON BLOCK END
SELECT * FROM "user_projected_renamed_casted_missing_handled"

stg_user.yml (Document the table)

version: 2
models:
- name: stg_user
  description: The table is about user profiles. It contains unique identifiers, email
    addresses, language preferences, full names, time zones, and usernames. The samples
    show fictional characters like Scooby Doo and Shaggy Rogers. Some fields are empty,
    suggesting optional information. The table likely represents user account data
    for a system or application.
  columns:
  - name: user_id
    description: Unique identifier for the user account
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each user account.
        For this table, each row represents a distinct user profile. The user_id appears
        to be unique across rows, as it contains both numeric and UUID-style identifiers.
  - name: language_preference
    description: User's language and region preference
    tests:
    - not_null
  - name: full_name
    description: User's full name
    tests:
    - not_null
  - name: time_zone
    description: User's preferred time zone
    tests:
    - not_null

stg_project_board (first 100 rows)

	board_id	project_id
0	7	10006
1	5	10004

stg_project_board.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
WITH 
"project_board_projected" AS (
    -- Projection: Selecting 2 out of 3 columns
    -- Columns projected out: ['_fivetran_synced']
    SELECT 
        "board_id",
        "project_id"
    FROM "project_board"
)

-- COCOON BLOCK END
SELECT * FROM "project_board_projected"

stg_project_board.yml (Document the table)

version: 2
models:
- name: stg_project_board
  description: The table 'project_board' represents a relation between project boards
    and projects. It links each board to a specific project. The 'board_id' column
    identifies unique boards. The 'project_id' column associates each board with a
    particular project. This structure allows multiple boards to be organized within
    different projects.
  columns:
  - name: board_id
    description: Unique identifier for each project board
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column represents a unique identifier for each project board.
        For this table, each row represents a distinct project board. The board_id
        is unique across rows as it's described as a unique identifier.
  - name: project_id
    description: Identifier of the project associated with the board
    tests:
    - not_null

Some tables log change events, which may be redundant to query. Instead, we take a snapshot of the latest.

snapshot_issue_multiselect_history (first 100 rows)

	record_id	field_id	issue_id
0	/zrY8m6q0VMW6ia1jGIerXqLIgQ=	customfield_10020	10027
1	w4UiT+hPMxJp3RhL/YFJm3uWL5U=	subtasks	10027
2	4pVgGn0qSqR2hCmMdo4wWHXmgew=	customfield_10021	10027
3	3p3gGn0qSqR2hCmMdo4wWHXa32m=	components	10027

snapshot_issue_multiselect_history.sql (clean the table)

-- Slowly Changing Dimension: Dimension keys are "issue_id", "field_id"
-- Effective date columns are "change_timestamp"
-- We will create Type 1 SCD (latest snapshot)
SELECT 
    "record_id",
    "field_id",
    "issue_id",
    "new_field_value"
FROM (
     SELECT 
            "record_id",
            "field_id",
            "issue_id",
            "new_field_value",
            ROW_NUMBER() OVER (
                PARTITION BY "issue_id", "field_id" 
                ORDER BY "change_timestamp" 
            DESC) AS "cocoon_rn"
    FROM "stg_issue_multiselect_history"
) ranked
WHERE "cocoon_rn" = 1

snapshot_issue_multiselect_history.yml (Document the table)

version: 2
models:
- name: snapshot_issue_multiselect_history
  description: The table is about current issue field values. It tracks the most recent
    state of various fields for each issue. Each row represents the latest value for
    a specific field of an issue. The table includes the issue ID, field ID, and the
    current field value. It provides a snapshot of the current state of issues without
    historical changes or timestamps.
  columns:
  - name: record_id
    description: Unique identifier for each record
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: This column appears to be a unique identifier for each change record.
        For this table, each row represents a single change to an issue field. The
        record_id seems to be unique across rows, as it's described as a unique identifier
        for each record.
  - name: field_id
    description: Identifier of the issue field that changed
    tests:
    - not_null
  - name: issue_id
    description: Identifier of the issue being tracked
    tests:
    - not_null
  - name: new_field_value
    description: New value of the field after the change
    tests:
    - not_null
cocoon_meta:
  scd_base_table: stg_issue_multiselect_history

snapshot_comment (first 100 rows)

	comment_id	comment_text	is_public	issue_id	author_id	creation_timestamp	last_update_author_id
0	10012	Joined Sample Sprint 2 7 days 9 hours 10 minutes ago	True	10035	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717	a9572f6a-3041-435a-96ed-5286e3811b33
1	10001	Hello	True	10025	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-10 19:19:41.224	a9572f6a-3041-435a-96ed-5286e3811b33
2	10013	To Do to In Progress 6 days 22 hours 26 minutes ago In Progress to Done 3 days 16 hours 34 minutes ago	True	10035	a9572f6a-3041-435a-96ed-5286e3811b33	2020-11-07 02:45:38.717	a9572f6a-3041-435a-96ed-5286e3811b33

snapshot_comment.sql (clean the table)

-- Slowly Changing Dimension: Dimension keys are "comment_id"
-- Effective date columns are "last_update_timestamp"
-- We will create Type 1 SCD (latest snapshot)
SELECT 
    "comment_id",
    "comment_text",
    "is_public",
    "issue_id",
    "author_id",
    "creation_timestamp",
    "last_update_author_id"
FROM (
     SELECT 
            "comment_id",
            "comment_text",
            "is_public",
            "issue_id",
            "author_id",
            "creation_timestamp",
            "last_update_author_id",
            ROW_NUMBER() OVER (
                PARTITION BY "comment_id" 
                ORDER BY "last_update_timestamp" 
            DESC) AS "cocoon_rn"
    FROM "stg_comment"
) ranked
WHERE "cocoon_rn" = 1

snapshot_comment.yml (Document the table)

version: 2
models:
- name: snapshot_comment
  description: The table is about the latest version of comments on issues. It tracks
    the most recent comment text, public visibility status, associated issue ID, original
    author, and creation timestamp for each unique comment. Each comment is linked
    to a specific issue and author. The table provides a current snapshot of all comments
    without version history.
  columns:
  - name: comment_id
    description: Unique identifier for the comment
    tests:
    - not_null
    - unique
    cocoon_meta:
      uniqueness: Unique dimension key, derived from the slowly changing dimension
  - name: comment_text
    description: Text content of the comment
    tests:
    - not_null
  - name: is_public
    description: Visibility status of the comment
    tests:
    - not_null
  - name: issue_id
    description: ID of the associated issue
    tests:
    - not_null
  - name: author_id
    description: ID of the comment author
    tests:
    - not_null
  - name: creation_timestamp
    description: Timestamp of comment creation
    tests:
    - not_null
  - name: last_update_author_id
    description: ID of the author who last updated the comment
    tests:
    - not_null
cocoon_meta:
  scd_base_table: stg_comment

We identify the primary key (PK) and foreign key (FK) from tables. We build a join graph that connects FK to PK.

Join Graph (FK to PK)

cocoon_join.yml (Document the joins)

join_graph:
- table_name: stg_field
  primary_key: field_id
  foreign_keys: []
- table_name: stg_field_option
  foreign_keys:
  - column: field_id
    reference:
      table_name: stg_field
      column: field_id
- table_name: stg_issue_field_history
  foreign_keys:
  - column: field_identifier
    reference:
      table_name: stg_field
      column: field_id
  - column: issue_identifier
    reference:
      table_name: stg_issue
      column: issue_id
- table_name: stg_issue
  primary_key: issue_id
  foreign_keys:
  - column: issue_type_id
    reference:
      table_name: stg_issue_type
      column: issue_type_id
  - column: priority_id
    reference:
      table_name: stg_priority
      column: priority_id
  - column: project_id
    reference:
      table_name: stg_project
      column: project_id
  - column: resolution_id
    reference:
      table_name: stg_resolution
      column: resolution_id
  - column: status_id
    reference:
      table_name: stg_status
      column: status_id
  - column: creator_id
    reference:
      table_name: stg_user
      column: user_id
  - column: reporter_id
    reference:
      table_name: stg_user
      column: user_id
- table_name: stg_issue_link
  foreign_keys:
  - column: issue_id
    reference:
      table_name: stg_issue
      column: issue_id
  - column: related_issue_id
    reference:
      table_name: stg_issue
      column: issue_id
- table_name: snapshot_comment
  foreign_keys:
  - column: issue_id
    reference:
      table_name: stg_issue
      column: issue_id
  - column: author_id
    reference:
      table_name: stg_user
      column: user_id
  - column: last_update_author_id
    reference:
      table_name: stg_user
      column: user_id
- table_name: snapshot_issue_multiselect_history
  foreign_keys:
  - column: issue_id
    reference:
      table_name: stg_issue
      column: issue_id
- table_name: stg_issue_type
  primary_key: issue_type_id
  foreign_keys: []
- table_name: stg_priority
  primary_key: priority_id
  foreign_keys: []
- table_name: stg_project
  primary_key: project_id
  foreign_keys:
  - column: category_id
    reference:
      table_name: stg_project_category
      column: category_id
- table_name: stg_component
  foreign_keys:
  - column: project_id
    reference:
      table_name: stg_project
      column: project_id
- table_name: stg_project_board
  foreign_keys:
  - column: project_id
    reference:
      table_name: stg_project
      column: project_id
  primary_key: board_id
- table_name: stg_version
  foreign_keys:
  - column: project_id
    reference:
      table_name: stg_project
      column: project_id
- table_name: stg_sprint
  foreign_keys:
  - column: board_id
    reference:
      table_name: stg_project_board
      column: board_id
- table_name: stg_project_category
  primary_key: category_id
  foreign_keys: []
- table_name: stg_status
  foreign_keys:
  - column: category_id
    reference:
      table_name: stg_project_category
      column: category_id
  primary_key: status_id
- table_name: stg_resolution
  primary_key: resolution_id
  foreign_keys: []
- table_name: stg_status_category
  foreign_keys:
  - column: status_id
    reference:
      table_name: stg_status
      column: status_id
- table_name: stg_user
  primary_key: user_id
  foreign_keys: []

We identify the entities and relationships behind the tables, and tell the story among these relationships.

Prev Next