We display the source tables from the data warehouses to model.

customer

Table Preview(first 5 rows)

C_CUSTKEY C_NAME C_ADDRESS C_CITY C_NATION C_REGION C_PHONE C_MKTSEGMENT
0 1 Customer#000000001 j5JsirBM9P MOROCCO 4 MOROCCO AFRICA 25-214-662-5265 BUILDING
1 2 Customer#000000002 487LW1dovn6Q4dMVym JORDAN 5 JORDAN MIDDLE EAST 23-159-880-1131 AUTOMOBILE
2 4 Customer#000000004 4u58h f EGYPT 6 EGYPT MIDDLE EAST 14-108-116-4678 MACHINERY
3 5 Customer#000000005 hwBtxkoBF qSW4KrI CANADA 2 CANADA AMERICA 13-360-101-6891 HOUSEHOLD
4 7 Customer#000000007 8OkMVLQ1dK6Mbu6WG9 CHINA 3 CHINA ASIA 28-777-614-8279 AUTOMOBILE

date

Table Preview(first 5 rows)

D_DATEKEY D_DATE D_DAYOFWEEK D_MONTH D_YEAR D_YEARMONTHNUM D_YEARMONTH D_DAYNUMINWEEK D_DAYNUMINMONTH D_DAYNUMINYEAR D_MONTHNUMINYEAR D_WEEKNUMINYEAR D_SELLINGSEASON D_LASTDAYINWEEKFL D_LASTDAYINMONTHFL D_HOLIDAYFL D_WEEKDAYFL
0 19920101 January 1, 1992 Thursday January 1992 199201 Jan1992 5 1 1 1 1 Winter 0 1 1 1
1 19920102 January 2, 1992 Friday January 1992 199201 Jan1992 6 2 2 1 1 Winter 0 1 0 1
2 19920103 January 3, 1992 Saturday January 1992 199201 Jan1992 7 3 3 1 1 Winter 1 1 0 0
3 19920104 January 4, 1992 Sunday January 1992 199201 Jan1992 1 4 4 1 1 Winter 0 1 0 0
4 19920105 January 5, 1992 Monday January 1992 199201 Jan1992 2 5 5 1 1 Winter 0 1 0 1

lineorder

Table Preview(first 5 rows)

LO_ORDERKEY LO_LINENUMBER LO_CUSTKEY LO_PARTKEY LO_SUPPKEY LO_ORDERDATE LO_ORDERPRIORITY LO_SHIPPRIORITY LO_QUANTITY LO_EXTENDEDPRICE LO_ORDTOTALPRICE LO_DISCOUNT LO_REVENUE LO_SUPPLYCOST LO_TAX LO_COMMITDATE LO_SHIPMODE
0 277316 7 19190 99546 1170 19950228 4-NOT SPECIFIED 0 13 2009202 23455557 0 2009202 92732 3 19950525 REG AIR
1 5929219 4 28511 48214 1427 19980702 1-URGENT 0 14 1627094 20688659 4 1562010 69732 0 19980922 SHIP
2 3693318 4 15811 100325 1170 19920807 2-HIGH 0 4 530128 19688727 9 482416 79519 0 19921013 FOB
3 2765669 2 2258 126666 35 19920504 4-NOT SPECIFIED 0 15 2538990 3523267 8 2335870 101559 8 19920703 MAIL
4 686017 1 23836 172615 804 19951014 1-URGENT 0 17 2868937 4014677 7 2668111 101256 8 19951213 TRUCK

part

Table Preview(first 5 rows)

P_PARTKEY P_NAME P_MFGR P_CATEGORY P_BRAND1 P_COLOR P_TYPE P_SIZE P_CONTAINER
0 10 moccasin royal MFGR#2 MFGR#21 MFGR#2128 floral LARGE BURNISHED STEEL 44 LG CAN
1 15 dark sky MFGR#3 MFGR#34 MFGR#3438 navajo LARGE ANODIZED BRASS 45 LG CASE
2 16 brown turquoise MFGR#5 MFGR#54 MFGR#5415 deep PROMO PLATED TIN 2 MED PACK
3 22 floral beige MFGR#4 MFGR#44 MFGR#4421 medium PROMO POLISHED BRASS 19 LG DRUM
4 25 forest magenta MFGR#2 MFGR#23 MFGR#2336 blush STANDARD BRUSHED COPPER 3 JUMBO BAG

supplier

Table Preview(first 5 rows)

S_SUPPKEY S_NAME S_ADDRESS S_CITY S_NATION S_REGION S_PHONE
0 1 Supplier#000000001 sdrGnXCDRcfriBvY0KL,i PERU 7 PERU AMERICA 27-989-741-2988
1 2 Supplier#000000002 TRMhVHz3XiFu ETHIOPIA 2 ETHIOPIA AFRICA 15-768-687-3665
2 3 Supplier#000000003 BZ0kXcHUcHjx62L7CjZS ARGENTINA8 ARGENTINA AMERICA 11-719-748-3364
3 4 Supplier#000000004 qGTQJXogS83a7MB MOROCCO 5 MOROCCO AFRICA 25-128-190-5944
4 5 Supplier#000000005 lONEYAh9sF IRAQ 1 IRAQ MIDDLE EAST 21-750-942-6364
We display the stage tables, their SQL transformations, and their configurations.

stg_date

Table Preview (first 5 rows)

day_of_week month_name year_ year_month_abbr day_number_in_week day_number_in_month day_number_in_year month_number week_number selling_season date_key full_date is_holiday is_last_day_of_month is_last_day_of_week is_weekday year_month_numeric
0 Thursday January 1992 Jan1992 5 1 1 1 1 Winter 19920101 1992-01-01 True True False True 199201
1 Friday January 1992 Jan1992 6 2 2 1 1 Winter 19920102 1992-01-02 False True False True 199201
2 Saturday January 1992 Jan1992 7 3 3 1 1 Winter 19920103 1992-01-03 False True True False 199201
3 Sunday January 1992 Jan1992 1 4 4 1 1 Winter 19920104 1992-01-04 False True False False 199201
4 Monday January 1992 Jan1992 2 5 5 1 1 Winter 19920105 1992-01-05 False True False True 199201

stg_date.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:48:59.029137+00:00
WITH 
"date_renamed" AS (
    -- Rename: Renaming columns
    -- D_DATEKEY -> date_key
    -- D_DATE -> full_date
    -- D_DAYOFWEEK -> day_of_week
    -- D_MONTH -> month_name
    -- D_YEAR -> year_
    -- D_YEARMONTHNUM -> year_month_numeric
    -- D_YEARMONTH -> year_month_abbr
    -- D_DAYNUMINWEEK -> day_number_in_week
    -- D_DAYNUMINMONTH -> day_number_in_month
    -- D_DAYNUMINYEAR -> day_number_in_year
    -- D_MONTHNUMINYEAR -> month_number
    -- D_WEEKNUMINYEAR -> week_number
    -- D_SELLINGSEASON -> selling_season
    -- D_LASTDAYINWEEKFL -> is_last_day_of_week
    -- D_LASTDAYINMONTHFL -> is_last_day_of_month
    -- D_HOLIDAYFL -> is_holiday
    -- D_WEEKDAYFL -> is_weekday
    SELECT 
        "D_DATEKEY" AS "date_key",
        "D_DATE" AS "full_date",
        "D_DAYOFWEEK" AS "day_of_week",
        "D_MONTH" AS "month_name",
        "D_YEAR" AS "year_",
        "D_YEARMONTHNUM" AS "year_month_numeric",
        "D_YEARMONTH" AS "year_month_abbr",
        "D_DAYNUMINWEEK" AS "day_number_in_week",
        "D_DAYNUMINMONTH" AS "day_number_in_month",
        "D_DAYNUMINYEAR" AS "day_number_in_year",
        "D_MONTHNUMINYEAR" AS "month_number",
        "D_WEEKNUMINYEAR" AS "week_number",
        "D_SELLINGSEASON" AS "selling_season",
        "D_LASTDAYINWEEKFL" AS "is_last_day_of_week",
        "D_LASTDAYINMONTHFL" AS "is_last_day_of_month",
        "D_HOLIDAYFL" AS "is_holiday",
        "D_WEEKDAYFL" AS "is_weekday"
    FROM "memory"."main"."date"
),

"date_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- month_name: The problem is that there are two misspelled month names in the dataset. 'Augest' is a misspelling of 'August', and 'Octorber' is a misspelling of 'October'. These are typos that need to be corrected to ensure accurate data representation and analysis. The correct values should be the proper spellings of these month names.
    SELECT
        "date_key",
        "full_date",
        "day_of_week",
        CASE
            WHEN "month_name" = 'Augest' THEN 'August'
            WHEN "month_name" = 'Octorber' THEN 'October'
            ELSE "month_name"
        END AS "month_name",
        "year_",
        "year_month_numeric",
        "year_month_abbr",
        "day_number_in_week",
        "day_number_in_month",
        "day_number_in_year",
        "month_number",
        "week_number",
        "selling_season",
        "is_last_day_of_week",
        "is_last_day_of_month",
        "is_holiday",
        "is_weekday"
    FROM "date_renamed"
),

"date_renamed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- date_key: from INT to VARCHAR
    -- full_date: from VARCHAR to DATE
    -- is_holiday: from INT to BOOLEAN
    -- is_last_day_of_month: from INT to BOOLEAN
    -- is_last_day_of_week: from INT to BOOLEAN
    -- is_weekday: from INT to BOOLEAN
    -- year_month_numeric: from INT to VARCHAR
    SELECT
        "day_of_week",
        "month_name",
        "year_",
        "year_month_abbr",
        "day_number_in_week",
        "day_number_in_month",
        "day_number_in_year",
        "month_number",
        "week_number",
        "selling_season",
        CAST("date_key" AS VARCHAR) 
        AS "date_key",
        CASE
            WHEN regexp_full_match("full_date", 'April \d{1,2}, \d{4}') THEN CAST(strptime("full_date", '%B %-d, %Y') AS DATE)
            WHEN regexp_full_match("full_date", 'Augest \d{1,2}, \d{4}') THEN CAST(strptime(replace("full_date", 'Augest', 'August'), '%B %-d, %Y') AS DATE)
            WHEN regexp_full_match("full_date", 'December \d{1,2}, \d{4}') THEN CAST(strptime("full_date", '%B %-d, %Y') AS DATE)
            WHEN regexp_full_match("full_date", '(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}') THEN CAST(strptime("full_date", '%B %d, %Y') AS DATE)
            WHEN regexp_full_match("full_date", 'Octorber \d{1,2}, \d{4}') THEN CAST(strptime(replace("full_date", 'Octorber', 'October'), '%B %d, %Y') AS DATE)
        END 
        AS "full_date",
        CAST("is_holiday" AS BOOLEAN) 
        AS "is_holiday",
        CAST("is_last_day_of_month" AS BOOLEAN) 
        AS "is_last_day_of_month",
        CAST("is_last_day_of_week" AS BOOLEAN) 
        AS "is_last_day_of_week",
        CAST("is_weekday" AS BOOLEAN) 
        AS "is_weekday",
        CAST("year_month_numeric" AS VARCHAR) 
        AS "year_month_numeric"
    FROM "date_renamed_cleaned"
)

-- COCOON BLOCK END
SELECT *
FROM "date_renamed_cleaned_casted"

stg_date.yml (Document the table)

version: 2
models:
- name: stg_date
  description: The table is about calendar dates. It contains various attributes of
    dates including the date key, full date, day of week, month, year, and numeric
    representations. It also includes flags for special days like holidays, last day
    of week/month, and selling seasons. The table provides a comprehensive breakdown
    of date information for data analysis and reporting purposes.
  columns:
  - name: day_of_week
    description: Name of the day of the week
    tests:
    - not_null
    - accepted_values:
        values:
        - Saturday
        - Thursday
        - Friday
        - Monday
        - Sunday
        - Tuesday
        - Wednesday
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: month_name
    description: Name of the month
    tests:
    - not_null
    - accepted_values:
        values:
        - August
        - January
        - July
        - March
        - May
        - October
        - April
        - June
        - September
        - February
        - December
        - November
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: year_
    description: Year in YYYY format
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: year_month_abbr
    description: Year and month abbreviation
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: day_number_in_week
    description: Day number within the week (1-7)
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: day_number_in_month
    description: Day number within the month
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: day_number_in_year
    description: Day number within the year
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: month_number
    description: Month number within the year (1-12)
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: week_number
    description: Week number within the year
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: selling_season
    description: Current selling season (e.g., Winter)
    tests:
    - not_null
    - accepted_values:
        values:
        - Summer
        - Winter
        - Fall
        - Christmas
        - Spring
    cocoon_meta:
      future_accepted_values:
      - Back to School
      - Halloween
      - Valentine's Day
      - Easter
      data_type:
        current_data_type: VARCHAR
  - name: date_key
    description: Unique identifier for each date
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column is a unique identifier for each date. For this table,
        each row represents a specific date. The date_key appears to be a concatenation
        of year, month, and day, which would be unique across rows.
      data_type:
        current_data_type: VARCHAR
  - name: full_date
    description: Full date in 'Month Day, Year' format
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents the full date in 'YYYY-MM-DD' format.
        For this table, each row represents a specific date. The full_date would be
        unique across rows as it represents a distinct calendar date.
      data_type:
        current_data_type: DATE
  - name: is_holiday
    description: Flag for holiday
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: BOOLEAN
  - name: is_last_day_of_month
    description: Flag for last day of the month
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: BOOLEAN
  - name: is_last_day_of_week
    description: Flag for last day of the week
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: BOOLEAN
  - name: is_weekday
    description: Flag for weekday (Monday-Friday)
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: BOOLEAN
  - name: year_month_numeric
    description: Year and month in YYYYMM format
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_part

Table Preview (first 5 rows)

part_id part_name manufacturer_code category_code brand_code color type_and_material size_ container_type
0 10 moccasin royal MFGR#2 MFGR#21 MFGR#2128 floral LARGE BURNISHED STEEL 44 LG CAN
1 15 dark sky MFGR#3 MFGR#34 MFGR#3438 navajo LARGE ANODIZED BRASS 45 LG CASE
2 16 brown turquoise MFGR#5 MFGR#54 MFGR#5415 deep PROMO PLATED TIN 2 MED PACK
3 22 floral beige MFGR#4 MFGR#44 MFGR#4421 medium PROMO POLISHED BRASS 19 LG DRUM
4 25 forest magenta MFGR#2 MFGR#23 MFGR#2336 blush STANDARD BRUSHED COPPER 3 JUMBO BAG

stg_part.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:51:15.538935+00:00
WITH 
"part_renamed" AS (
    -- Rename: Renaming columns
    -- P_PARTKEY -> part_id
    -- P_NAME -> part_name
    -- P_MFGR -> manufacturer_code
    -- P_CATEGORY -> category_code
    -- P_BRAND1 -> brand_code
    -- P_COLOR -> color
    -- P_TYPE -> type_and_material
    -- P_SIZE -> size_
    -- P_CONTAINER -> container_type
    SELECT 
        "P_PARTKEY" AS "part_id",
        "P_NAME" AS "part_name",
        "P_MFGR" AS "manufacturer_code",
        "P_CATEGORY" AS "category_code",
        "P_BRAND1" AS "brand_code",
        "P_COLOR" AS "color",
        "P_TYPE" AS "type_and_material",
        "P_SIZE" AS "size_",
        "P_CONTAINER" AS "container_type"
    FROM "memory"."main"."part"
)

-- COCOON BLOCK END
SELECT *
FROM "part_renamed"

stg_part.yml (Document the table)

version: 2
models:
- name: stg_part
  description: The table is about parts. It contains details of various parts including
    their unique key, name, manufacturer, category, brand, color, type, size, and
    container. Each row represents a different part with its specific attributes.
    The table seems to be a catalog or inventory of parts with their characteristics
    and classifications.
  columns:
  - name: part_id
    description: Unique identifier for the part
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each part. For
        this table, each row represents a different part, and part_id appears to be
        unique across rows.
      data_type:
        current_data_type: INT
  - name: part_name
    description: Descriptive name of the part
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: ^[a-z]+ [a-z]+$
        summary: Two color names or descriptors separated by a space
      data_type:
        current_data_type: VARCHAR
  - name: manufacturer_code
    description: Manufacturer code of the part
    tests:
    - not_null
    - accepted_values:
        values:
        - MFGR#2
        - MFGR#4
        - MFGR#1
        - MFGR#5
        - MFGR#3
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: category_code
    description: Category code of the part
    tests:
    - not_null
    - accepted_values:
        values:
        - MFGR#23
        - MFGR#41
        - MFGR#14
        - MFGR#11
        - MFGR#24
        - MFGR#54
        - MFGR#53
        - MFGR#22
        - MFGR#45
        - MFGR#32
        - MFGR#42
        - MFGR#34
        - MFGR#51
        - MFGR#13
        - MFGR#31
        - MFGR#25
        - MFGR#21
        - MFGR#44
        - MFGR#43
        - MFGR#33
        - MFGR#52
        - MFGR#55
        - MFGR#35
        - MFGR#12
        - MFGR#15
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: brand_code
    description: Brand code of the part
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: MFGR#\d{3,4}
        summary: Manufacturer codes starting with "MFGR#" followed by 3 or 4 digits
      data_type:
        current_data_type: VARCHAR
  - name: color
    description: Color of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: type_and_material
    description: Type and material of the part
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: (ECONOMY|MEDIUM|SMALL|LARGE|PROMO|STANDARD) (ANODIZED|POLISHED|PLATED|BURNISHED|BRUSHED)
          (STEEL|NICKEL|BRASS|COPPER|TIN)
        summary: Product type, treatment, and material in all caps
      data_type:
        current_data_type: VARCHAR
  - name: size_
    description: Size of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: container_type
    description: Container type for the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_customer

Table Preview (first 5 rows)

customer_id customer_name city country region phone_number market_segment street_address
0 1 Customer#000000001 MOROCCO MOROCCO AFRICA 25-214-662-5265 BUILDING j5JsirBM9P
1 2 Customer#000000002 JORDAN 5 JORDAN MIDDLE EAST 23-159-880-1131 AUTOMOBILE 487LW1dovn6Q4dMVym
2 4 Customer#000000004 EGYPT EGYPT MIDDLE EAST 14-108-116-4678 MACHINERY 4u58h f
3 5 Customer#000000005 CANADA 2 CANADA AMERICA 13-360-101-6891 HOUSEHOLD hwBtxkoBF qSW4KrI
4 7 Customer#000000007 CHINA CHINA ASIA 28-777-614-8279 AUTOMOBILE 8OkMVLQ1dK6Mbu6WG9

stg_customer.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:47:19.528883+00:00
WITH 
"customer_renamed" AS (
    -- Rename: Renaming columns
    -- C_CUSTKEY -> customer_id
    -- C_NAME -> customer_name
    -- C_ADDRESS -> street_address
    -- C_CITY -> city
    -- C_NATION -> country
    -- C_REGION -> region
    -- C_PHONE -> phone_number
    -- C_MKTSEGMENT -> market_segment
    SELECT 
        "C_CUSTKEY" AS "customer_id",
        "C_NAME" AS "customer_name",
        "C_ADDRESS" AS "street_address",
        "C_CITY" AS "city",
        "C_NATION" AS "country",
        "C_REGION" AS "region",
        "C_PHONE" AS "phone_number",
        "C_MKTSEGMENT" AS "market_segment"
    FROM "memory"."main"."customer"
),

"customer_renamed_trimmed" AS (
    -- Trim Leading and Trailing Spaces
    SELECT
        "customer_id",
        "customer_name",
        "city",
        "country",
        "region",
        "phone_number",
        "market_segment",
        TRIM("street_address") AS "street_address"
    FROM "customer_renamed"
),

"customer_renamed_trimmed_cleaned" AS (
    -- Clean unusual string values: 
    -- city: The problem is that the city column contains country names followed by numbers, with inconsistent spacing and truncation. The country names should be standardized, and the numbers (which likely represent some kind of code or index) should be separated. "UNITED ST" is truncated and should be "UNITED STATES". "UNITED KI" should be "UNITED KINGDOM". "SAUDI ARA" should be "SAUDI ARABIA". The correct values should have consistent spacing between the country name and the number, with the country name in full.
    -- street_address: Values starting with commas are unusual. Random strings like '2z1 vZUo2qjaxg7qJouIBb' are not typical for street addresses.
    SELECT
        "customer_id",
        "customer_name",
        CASE
            WHEN "city" = 'UNITED ST4' THEN 'UNITED STATES 4'
            WHEN "city" = 'UNITED ST0' THEN 'UNITED STATES 0'
            WHEN "city" = 'UNITED ST1' THEN 'UNITED STATES 1'
            WHEN "city" = 'UNITED KI5' THEN 'UNITED KINGDOM 5'
            WHEN "city" = 'SAUDI ARA7' THEN 'SAUDI ARABIA 7'
            WHEN "city" = 'ARGENTINA4' THEN 'ARGENTINA 4'
            WHEN "city" = 'ARGENTINA5' THEN 'ARGENTINA 5'
            WHEN "city" = 'MOZAMBIQU7' THEN 'MOZAMBIQUE 7'
            WHEN "city" = 'MOZAMBIQU9' THEN 'MOZAMBIQUE 9'
            WHEN "city" = 'RUSSIA   5' THEN 'RUSSIA 5'
            WHEN "city" = 'RUSSIA   6' THEN 'RUSSIA 6'
            WHEN "city" = 'RUSSIA   3' THEN 'RUSSIA 3'
            WHEN "city" = 'MOROCCO  5' THEN 'MOROCCO 5'
            WHEN "city" = 'FRANCE   8' THEN 'FRANCE 8'
            WHEN "city" = 'JAPAN    3' THEN 'JAPAN 3'
            WHEN "city" = 'ALGERIA  8' THEN 'ALGERIA 8'
            WHEN "city" = 'ALGERIA  0' THEN 'ALGERIA 0'
            WHEN "city" = 'ALGERIA  5' THEN 'ALGERIA 5'
            WHEN "city" = 'CHINA    5' THEN 'CHINA 5'
            WHEN "city" = 'EGYPT    7' THEN 'EGYPT 7'
            WHEN "city" = 'EGYPT    1' THEN 'EGYPT 1'
            WHEN "city" = 'EGYPT    3' THEN 'EGYPT 3'
            WHEN "city" = 'JORDAN   5' THEN 'JORDAN 5'
            WHEN "city" = 'JORDAN   8' THEN 'JORDAN 8'
            WHEN "city" = 'KENYA    2' THEN 'KENYA 2'
            WHEN "city" = 'KENYA    9' THEN 'KENYA 9'
            WHEN "city" = 'PERU     8' THEN 'PERU 8'
            WHEN "city" = 'CANADA   2' THEN 'CANADA 2'
            WHEN "city" = 'CANADA   4' THEN 'CANADA 4'
            WHEN "city" = 'CANADA   9' THEN 'CANADA 9'
            WHEN "city" = 'CANADA   0' THEN 'CANADA 0'
            WHEN "city" = 'VIETNAM  0' THEN 'VIETNAM 0'
            WHEN "city" = 'GERMANY  0' THEN 'GERMANY 0'
            WHEN "city" = 'GERMANY  9' THEN 'GERMANY 9'
            WHEN "city" = 'GERMANY  4' THEN 'GERMANY 4'
            WHEN "city" = 'IRAN     4' THEN 'IRAN 4'
            WHEN "city" = 'IRAN     6' THEN 'IRAN 6'
            WHEN "city" = 'IRAN     7' THEN 'IRAN 7'
            WHEN "city" = 'IRAQ     4' THEN 'IRAQ 4'
            WHEN "city" = 'IRAQ     9' THEN 'IRAQ 9'
            WHEN "city" = 'IRAQ     5' THEN 'IRAQ 5'
            WHEN "city" = 'INDIA    8' THEN 'INDIA 8'
            WHEN "city" = 'INDONESIA1' THEN 'INDONESIA 1'
            WHEN "city" = 'INDONESIA3' THEN 'INDONESIA 3'
            WHEN "city" = 'BRAZIL   4' THEN 'BRAZIL 4'
            WHEN "city" = 'ROMANIA  4' THEN 'ROMANIA 4'
            WHEN "city" = 'ROMANIA  7' THEN 'ROMANIA 7'
            WHEN "city" = 'MOZAMBIQU2' THEN 'MOZAMBIQUE'
            WHEN "city" = 'ROMANIA  1' THEN 'ROMANIA'
            WHEN "city" = 'ROMANIA  3' THEN 'ROMANIA'
            WHEN "city" = 'VIETNAM  5' THEN 'VIETNAM'
            WHEN "city" = 'ARGENTINA2' THEN 'ARGENTINA'
            WHEN "city" = 'CHINA    2' THEN 'CHINA'
            WHEN "city" = 'CHINA    4' THEN 'CHINA'
            WHEN "city" = 'FRANCE   0' THEN 'FRANCE'
            WHEN "city" = 'JAPAN    9' THEN 'JAPAN'
            WHEN "city" = 'UNITED KI3' THEN 'UNITED KINGDOM'
            WHEN "city" = 'ALGERIA  2' THEN 'ALGERIA'
            WHEN "city" = 'ARGENTINA7' THEN 'ARGENTINA'
            WHEN "city" = 'BRAZIL   8' THEN 'BRAZIL'
            WHEN "city" = 'ETHIOPIA 8' THEN 'ETHIOPIA'
            WHEN "city" = 'FRANCE   6' THEN 'FRANCE'
            WHEN "city" = 'INDONESIA2' THEN 'INDONESIA'
            WHEN "city" = 'JORDAN   2' THEN 'JORDAN'
            WHEN "city" = 'KENYA    3' THEN 'KENYA'
            WHEN "city" = 'ROMANIA  5' THEN 'ROMANIA'
            WHEN "city" = 'UNITED KI0' THEN 'UNITED KINGDOM'
            WHEN "city" = 'ETHIOPIA 6' THEN 'ETHIOPIA'
            WHEN "city" = 'INDONESIA7' THEN 'INDONESIA'
            WHEN "city" = 'MOROCCO  4' THEN 'MOROCCO'
            WHEN "city" = 'MOROCCO  8' THEN 'MOROCCO'
            WHEN "city" = 'MOZAMBIQU1' THEN 'MOZAMBIQUE'
            WHEN "city" = 'MOZAMBIQU3' THEN 'MOZAMBIQUE'
            WHEN "city" = 'UNITED ST2' THEN 'UNITED STATES'
            WHEN "city" = 'UNITED ST5' THEN 'UNITED STATES'
            WHEN "city" = 'EGYPT    0' THEN 'EGYPT'
            WHEN "city" = 'EGYPT    8' THEN 'EGYPT'
            WHEN "city" = 'EGYPT    9' THEN 'EGYPT'
            WHEN "city" = 'INDIA    9' THEN 'INDIA'
            WHEN "city" = 'IRAN     1' THEN 'IRAN'
            WHEN "city" = 'IRAQ     3' THEN 'IRAQ'
            WHEN "city" = 'PERU     6' THEN 'PERU'
            WHEN "city" = 'UNITED ST6' THEN 'UNITED STATES'
            WHEN "city" = 'UNITED ST8' THEN 'UNITED STATES'
            WHEN "city" = 'CANADA   6' THEN 'CANADA'
            WHEN "city" = 'CHINA    0' THEN 'CHINA'
            WHEN "city" = 'CHINA    1' THEN 'CHINA'
            WHEN "city" = 'CHINA    3' THEN 'CHINA'
            WHEN "city" = 'GERMANY  6' THEN 'GERMANY'
            WHEN "city" = 'INDONESIA8' THEN 'INDONESIA'
            WHEN "city" = 'IRAQ     7' THEN 'IRAQ'
            WHEN "city" = 'JAPAN    1' THEN 'JAPAN'
            WHEN "city" = 'JAPAN    6' THEN 'JAPAN'
            WHEN "city" = 'ROMANIA  6' THEN 'ROMANIA'
            WHEN "city" = 'VIETNAM  1' THEN 'VIETNAM'
            WHEN "city" = 'ARGENTINA8' THEN 'ARGENTINA'
            WHEN "city" = 'GERMANY  7' THEN 'GERMANY'
            WHEN "city" = 'SAUDI ARA1' THEN 'SAUDI ARABIA 1'
            WHEN "city" = 'SAUDI ARA2' THEN 'SAUDI ARABIA 2'
            WHEN "city" = 'SAUDI ARA4' THEN 'SAUDI ARABIA 4'
            WHEN "city" = 'UNITED KI2' THEN 'UNITED KINGDOM 2'
            WHEN "city" = 'UNITED KI7' THEN 'UNITED KINGDOM 7'
            WHEN "city" = 'MOZAMBIQU5' THEN 'MOZAMBIQUE 5'
            WHEN "city" = 'INDONESIA6' THEN 'INDONESIA 6'
            WHEN "city" = 'IRAN     0' THEN 'IRAN 0'
            WHEN "city" = 'IRAQ     6' THEN 'IRAQ 6'
            WHEN "city" = 'KENYA    1' THEN 'KENYA 1'
            WHEN "city" = 'KENYA    7' THEN 'KENYA 7'
            WHEN "city" = 'MOROCCO  1' THEN 'MOROCCO 1'
            WHEN "city" = 'VIETNAM  2' THEN 'VIETNAM 2'
            WHEN "city" = 'BRAZIL   2' THEN 'BRAZIL 2'
            WHEN "city" = 'CANADA   3' THEN 'CANADA 3'
            WHEN "city" = 'CHINA    6' THEN 'CHINA 6'
            WHEN "city" = 'GERMANY  5' THEN 'GERMANY 5'
            WHEN "city" = 'INDIA    4' THEN 'INDIA 4'
            WHEN "city" = 'IRAN     3' THEN 'IRAN 3'
            WHEN "city" = 'JAPAN    0' THEN 'JAPAN 0'
            WHEN "city" = 'JORDAN   3' THEN 'JORDAN 3'
            WHEN "city" = 'MOROCCO  9' THEN 'MOROCCO 9'
            WHEN "city" = 'VIETNAM  8' THEN 'VIETNAM 8'
            WHEN "city" = 'ARGENTINA3' THEN 'ARGENTINA 3'
            WHEN "city" = 'CHINA    9' THEN 'CHINA 9'
            WHEN "city" = 'FRANCE   2' THEN 'FRANCE 2'
            WHEN "city" = 'GERMANY  8' THEN 'GERMANY 8'
            WHEN "city" = 'INDIA    3' THEN 'INDIA 3'
            WHEN "city" = 'INDIA    7' THEN 'INDIA 7'
            WHEN "city" = 'JORDAN   1' THEN 'JORDAN 1'
            WHEN "city" = 'JORDAN   7' THEN 'JORDAN 7'
            WHEN "city" = 'KENYA    0' THEN 'KENYA 0'
            WHEN "city" = 'PERU     0' THEN 'PERU 0'
            WHEN "city" = 'BRAZIL   0' THEN 'BRAZIL 0'
            WHEN "city" = 'BRAZIL   7' THEN 'BRAZIL 7'
            WHEN "city" = 'FRANCE   9' THEN 'FRANCE 9'
            WHEN "city" = 'IRAN     5' THEN 'IRAN 5'
            WHEN "city" = 'JAPAN    5' THEN 'JAPAN 5'
            WHEN "city" = 'KENYA    8' THEN 'KENYA 8'
            WHEN "city" = 'MOROCCO  6' THEN 'MOROCCO 6'
            WHEN "city" = 'RUSSIA   9' THEN 'RUSSIA 9'
            WHEN "city" = 'CANADA   1' THEN 'CANADA 1'
            WHEN "city" = 'EGYPT    5' THEN 'EGYPT 5'
            WHEN "city" = 'INDIA    0' THEN 'INDIA 0'
            WHEN "city" = 'INDONESIA5' THEN 'INDONESIA 5'
            WHEN "city" = 'IRAN     9' THEN 'IRAN 9'
            WHEN "city" = 'JORDAN   6' THEN 'JORDAN 6'
            WHEN "city" = 'PERU     4' THEN 'PERU 4'
            WHEN "city" = 'PERU     9' THEN 'PERU'
            WHEN "city" = 'RUSSIA   4' THEN 'RUSSIA'
            WHEN "city" = 'RUSSIA   7' THEN 'RUSSIA'
            WHEN "city" = 'SAUDI ARA9' THEN 'SAUDI ARABIA'
            WHEN "city" = 'UNITED KI1' THEN 'UNITED KINGDOM'
            WHEN "city" = 'ALGERIA  1' THEN 'ALGERIA'
            WHEN "city" = 'CHINA    8' THEN 'CHINA'
            WHEN "city" = 'FRANCE   5' THEN 'FRANCE'
            WHEN "city" = 'GERMANY  1' THEN 'GERMANY'
            WHEN "city" = 'INDIA    1' THEN 'INDIA'
            WHEN "city" = 'IRAN     8' THEN 'IRAN'
            WHEN "city" = 'KENYA    6' THEN 'KENYA'
            WHEN "city" = 'ROMANIA  2' THEN 'ROMANIA'
            WHEN "city" = 'RUSSIA   8' THEN 'RUSSIA'
            WHEN "city" = 'UNITED KI8' THEN 'UNITED KINGDOM'
            WHEN "city" = 'VIETNAM  6' THEN 'VIETNAM'
            WHEN "city" = 'VIETNAM  9' THEN 'VIETNAM'
            WHEN "city" = 'ARGENTINA0' THEN 'ARGENTINA'
            WHEN "city" = 'BRAZIL   6' THEN 'BRAZIL'
            WHEN "city" = 'ETHIOPIA 0' THEN 'ETHIOPIA'
            WHEN "city" = 'ETHIOPIA 1' THEN 'ETHIOPIA'
            WHEN "city" = 'FRANCE   3' THEN 'FRANCE'
            WHEN "city" = 'FRANCE   7' THEN 'FRANCE'
            WHEN "city" = 'GERMANY  2' THEN 'GERMANY'
            WHEN "city" = 'INDONESIA0' THEN 'INDONESIA'
            WHEN "city" = 'IRAN     2' THEN 'IRAN'
            WHEN "city" = 'IRAQ     1' THEN 'IRAQ'
            WHEN "city" = 'IRAQ     8' THEN 'IRAQ'
            WHEN "city" = 'KENYA    5' THEN 'KENYA'
            WHEN "city" = 'MOZAMBIQU8' THEN 'MOZAMBIQUE'
            WHEN "city" = 'PERU     3' THEN 'PERU'
            WHEN "city" = 'UNITED KI4' THEN 'UNITED KINGDOM'
            WHEN "city" = 'ALGERIA  7' THEN 'ALGERIA'
            WHEN "city" = 'CANADA   7' THEN 'CANADA'
            WHEN "city" = 'CANADA   8' THEN 'CANADA'
            WHEN "city" = 'IRAQ     2' THEN 'IRAQ'
            WHEN "city" = 'SAUDI ARA5' THEN 'SAUDI ARABIA'
            WHEN "city" = 'UNITED ST9' THEN 'UNITED STATES'
            WHEN "city" = 'VIETNAM  4' THEN 'VIETNAM'
            WHEN "city" = 'GERMANY  3' THEN 'GERMANY'
            WHEN "city" = 'IRAQ     0' THEN 'IRAQ'
            WHEN "city" = 'JORDAN   0' THEN 'JORDAN'
            WHEN "city" = 'MOROCCO  0' THEN 'MOROCCO'
            WHEN "city" = 'MOROCCO  3' THEN 'MOROCCO'
            WHEN "city" = 'ROMANIA  9' THEN 'ROMANIA'
            WHEN "city" = 'SAUDI ARA3' THEN 'SAUDI ARABIA'
            WHEN "city" = 'UNITED KI9' THEN 'UNITED KINGDOM'
            WHEN "city" = 'UNITED ST7' THEN 'UNITED STATES'
            WHEN "city" = 'BRAZIL   1' THEN 'BRAZIL'
            WHEN "city" = 'BRAZIL   9' THEN 'BRAZIL'
            WHEN "city" = 'EGYPT    6' THEN 'EGYPT'
            WHEN "city" = 'FRANCE   4' THEN 'FRANCE'
            WHEN "city" = 'JAPAN    7' THEN 'JAPAN'
            WHEN "city" = 'JAPAN    8' THEN 'JAPAN'
            WHEN "city" = 'MOZAMBIQU6' THEN 'MOZAMBIQUE'
            WHEN "city" = 'PERU     5' THEN 'PERU'
            WHEN "city" = 'PERU     7' THEN 'PERU'
            WHEN "city" = 'RUSSIA   2' THEN 'RUSSIA'
            WHEN "city" = 'UNITED KI6' THEN 'UNITED KINGDOM'
            WHEN "city" = 'ARGENTINA1' THEN 'ARGENTINA'
            WHEN "city" = 'BRAZIL   5' THEN 'BRAZIL'
            WHEN "city" = 'CHINA    7' THEN 'CHINA'
            WHEN "city" = 'FRANCE   1' THEN 'FRANCE'
            WHEN "city" = 'INDIA    5' THEN 'INDIA'
            WHEN "city" = 'JAPAN    2' THEN 'JAPAN'
            WHEN "city" = 'JORDAN   9' THEN 'JORDAN'
            WHEN "city" = 'UNITED ST3' THEN 'UNITED STATES'
            WHEN "city" = 'ALGERIA  4' THEN 'ALGERIA'
            WHEN "city" = 'ARGENTINA6' THEN 'ARGENTINA'
            WHEN "city" = 'INDONESIA9' THEN 'INDONESIA'
            WHEN "city" = 'MOROCCO  2' THEN 'MOROCCO'
            WHEN "city" = 'PERU     1' THEN 'PERU'
            WHEN "city" = 'ROMANIA  0' THEN 'ROMANIA'
            WHEN "city" = 'RUSSIA   1' THEN 'RUSSIA'
            WHEN "city" = 'INDIA    6' THEN 'INDIA'
            WHEN "city" = 'INDONESIA4' THEN 'INDONESIA'
            WHEN "city" = 'PERU     2' THEN 'PERU'
            WHEN "city" = 'SAUDI ARA6' THEN 'SAUDI ARABIA'
            WHEN "city" = 'VIETNAM  7' THEN 'VIETNAM'
            WHEN "city" = 'BRAZIL   3' THEN 'BRAZIL'
            WHEN "city" = 'EGYPT    2' THEN 'EGYPT'
            WHEN "city" = 'EGYPT    4' THEN 'EGYPT'
            WHEN "city" = 'ETHIOPIA 4' THEN 'ETHIOPIA'
            WHEN "city" = 'SAUDI ARA8' THEN 'SAUDI ARABIA'
            WHEN "city" = 'ETHIOPIA 2' THEN 'ETHIOPIA'
            WHEN "city" = 'MOZAMBIQU4' THEN 'MOZAMBIQUE'
            WHEN "city" = 'ROMANIA  8' THEN 'ROMANIA'
            WHEN "city" = 'JAPAN    4' THEN 'JAPAN'
            WHEN "city" = 'VIETNAM  3' THEN 'VIETNAM'
            WHEN "city" = 'INDIA    2' THEN 'INDIA'
            WHEN "city" = 'JORDAN   4' THEN 'JORDAN'
            WHEN "city" = 'ALGERIA  3' THEN 'ALGERIA'
            WHEN "city" = 'ALGERIA  9' THEN 'ALGERIA'
            WHEN "city" = 'KENYA    4' THEN 'KENYA'
            WHEN "city" = 'MOZAMBIQU0' THEN 'MOZAMBIQUE'
            WHEN "city" = 'RUSSIA   0' THEN 'RUSSIA'
            WHEN "city" = 'SAUDI ARA0' THEN 'SAUDI ARABIA'
            WHEN "city" = 'ALGERIA  6' THEN 'ALGERIA'
            WHEN "city" = 'MOROCCO  7' THEN 'MOROCCO'
            WHEN "city" = 'CANADA   5' THEN 'CANADA'
            ELSE "city"
        END AS "city",
        "country",
        "region",
        "phone_number",
        "market_segment",
        "street_address"
    FROM "customer_renamed_trimmed"
)

-- COCOON BLOCK END
SELECT *
FROM "customer_renamed_trimmed_cleaned"

stg_customer.yml (Document the table)

version: 2
models:
- name: stg_customer
  description: The table is about customers. It contains customer details including
    a unique customer key, name, address, city, nation, region, phone number, and
    market segment. Each row represents a single customer with their specific information.
    The table appears to be part of a larger database system, likely for managing
    customer information in a business context.
  columns:
  - name: customer_id
    description: Unique identifier for each customer
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each customer.
        For this table, each row is for a single customer. customer_id appears to
        be unique across rows, as it's formatted as "Customer#000000001", "Customer#000000002",
        etc.
      data_type:
        current_data_type: INT
  - name: customer_name
    description: Customer's full name
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: Customer#\d{9}
        summary: Customer IDs in the format of Customer# followed by a 9-digit number
      data_type:
        current_data_type: VARCHAR
  - name: city
    description: Customer's city of residence
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: '[A-Z ]+\s\d$'
        summary: Country name followed by a space and a single digit
      - regex: ^[A-Z ]+$
        summary: Country name only
      - regex: ARGENTINA\d
        summary: Text starting with "ARGENTINA" followed by a single digit
      data_type:
        current_data_type: VARCHAR
  - name: country
    description: Customer's country of residence
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: region
    description: Geographical region of the customer
    tests:
    - not_null
    - accepted_values:
        values:
        - EUROPE
        - AMERICA
        - MIDDLE EAST
        - ASIA
        - AFRICA
    cocoon_meta:
      future_accepted_values:
      - OCEANIA
      - ANTARCTICA
      - CARIBBEAN
      - CENTRAL AMERICA
      - SOUTH AMERICA
      data_type:
        current_data_type: VARCHAR
  - name: phone_number
    description: Customer's phone number
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: \d{2}-\d{3}-\d{3}-\d{4}
        summary: Phone numbers in the format of XX-XXX-XXX-XXXX, where X represents
          a digit
      data_type:
        current_data_type: VARCHAR
  - name: market_segment
    description: Market segment the customer belongs to
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: street_address
    description: Customer's street address
    tests:
    - not_null
    cocoon_meta:
      unusual_values: Values starting with commas are unusual. Random strings like
        '2z1 vZUo2qjaxg7qJouIBb' are not typical for street addresses.
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_supplier

Table Preview (first 5 rows)

supplier_id supplier_name city country region phone_number street_address
0 1 Supplier#000000001 PERU 7 PERU AMERICA 27-989-741-2988 sdrGnXCDRcfriBvY0KL,i
1 2 Supplier#000000002 ETHIOPIA 2 ETHIOPIA AFRICA 15-768-687-3665 TRMhVHz3XiFu
2 3 Supplier#000000003 ARGENTINA 8 ARGENTINA AMERICA 11-719-748-3364 BZ0kXcHUcHjx62L7CjZS
3 4 Supplier#000000004 MOROCCO 5 MOROCCO AFRICA 25-128-190-5944 qGTQJXogS83a7MB
4 5 Supplier#000000005 IRAQ 1 IRAQ MIDDLE EAST 21-750-942-6364 lONEYAh9sF

stg_supplier.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:52:51.226296+00:00
WITH 
"supplier_renamed" AS (
    -- Rename: Renaming columns
    -- S_SUPPKEY -> supplier_id
    -- S_NAME -> supplier_name
    -- S_ADDRESS -> street_address
    -- S_CITY -> city
    -- S_NATION -> country
    -- S_REGION -> region
    -- S_PHONE -> phone_number
    SELECT 
        "S_SUPPKEY" AS "supplier_id",
        "S_NAME" AS "supplier_name",
        "S_ADDRESS" AS "street_address",
        "S_CITY" AS "city",
        "S_NATION" AS "country",
        "S_REGION" AS "region",
        "S_PHONE" AS "phone_number"
    FROM "memory"."main"."supplier"
),

"supplier_renamed_trimmed" AS (
    -- Trim Leading and Trailing Spaces
    SELECT
        "supplier_id",
        "supplier_name",
        "city",
        "country",
        "region",
        "phone_number",
        TRIM("street_address") AS "street_address"
    FROM "supplier_renamed"
),

"supplier_renamed_trimmed_cleaned" AS (
    -- Clean unusual string values: 
    -- city: The problem is that some city names are truncated or have inconsistent spacing. The "SAUDI ARA" entries are truncated and should be "SAUDI ARABIA". The "INDONESIA" entries have inconsistent spacing, with "INDONESIA4" having no space and "INDONESIA1" having one space. All other entries have 2 spaces between the country name and number. The correct values should have consistent formatting with 2 spaces between the country name and number for all entries.
    -- street_address: All values are unusual because they start with commas and consist of random alphanumeric strings instead of typical address components.
    SELECT
        "supplier_id",
        "supplier_name",
        CASE
            WHEN "city" = 'INDONESIA4' THEN 'INDONESIA 4'
            WHEN "city" = 'INDONESIA1' THEN 'INDONESIA 1'
            WHEN "city" = 'SAUDI ARA0' THEN 'SAUDI ARABIA 0'
            WHEN "city" = 'SAUDI ARA7' THEN 'SAUDI ARABIA 7'
            WHEN "city" = 'SAUDI ARA5' THEN 'SAUDI ARABIA 5'
            WHEN "city" = 'SAUDI ARA8' THEN 'SAUDI ARABIA 8'
            WHEN "city" = 'SAUDI ARA1' THEN 'SAUDI ARA 1'
            WHEN "city" = 'SAUDI ARA6' THEN 'SAUDI ARA 6'
            WHEN "city" = 'INDONESIA8' THEN 'INDONESIA 8'
            WHEN "city" = 'INDONESIA3' THEN 'INDONESIA 3'
            WHEN "city" = 'INDONESIA7' THEN 'INDONESIA 7'
            WHEN "city" = 'INDONESIA9' THEN 'INDONESIA 9'
            WHEN "city" = 'INDONESIA0' THEN 'INDONESIA 0'
            WHEN "city" = 'UNITED KI9' THEN 'UNITED KI 9'
            WHEN "city" = 'UNITED ST0' THEN 'UNITED ST 0'
            WHEN "city" = 'UNITED ST8' THEN 'UNITED ST 8'
            WHEN "city" = 'UNITED ST9' THEN 'UNITED ST 9'
            WHEN "city" = 'ARGENTINA7' THEN 'ARGENTINA 7'
            WHEN "city" = 'ARGENTINA8' THEN 'ARGENTINA 8'
            WHEN "city" = 'SAUDI ARA2' THEN 'SAUDI AR 2'
            WHEN "city" = 'SAUDI ARA9' THEN 'SAUDI AR 9'
            WHEN "city" = 'INDONESIA2' THEN 'INDONESI 2'
            WHEN "city" = 'UNITED ST4' THEN 'UNITED S 4'
            WHEN "city" = 'UNITED ST5' THEN 'UNITED S 5'
            WHEN "city" = 'UNITED ST6' THEN 'UNITED S 6'
            WHEN "city" = 'UNITED ST7' THEN 'UNITED S 7'
            WHEN "city" = 'UNITED KI0' THEN 'UNITED K 0'
            WHEN "city" = 'MOZAMBIQU2' THEN 'MOZAMBIQ 2'
            WHEN "city" = 'MOZAMBIQU6' THEN 'MOZAMBIQ 6'
            WHEN "city" = 'INDONESIA5' THEN 'INDONESIA 5'
            WHEN "city" = 'INDONESIA6' THEN 'INDONESIA 6'
            ELSE "city"
        END AS "city",
        "country",
        "region",
        "phone_number",
        "street_address"
    FROM "supplier_renamed_trimmed"
)

-- COCOON BLOCK END
SELECT *
FROM "supplier_renamed_trimmed_cleaned"

stg_supplier.yml (Document the table)

version: 2
models:
- name: stg_supplier
  description: The table is about suppliers. It contains details of each supplier
    including their unique key, name, address, city, nation, region, and phone number.
    Each row represents a single supplier with their specific information. The table
    appears to be a part of a larger database system, likely used for supply chain
    management or procurement processes.
  columns:
  - name: supplier_id
    description: Unique identifier for the supplier
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each supplier.
        For this table, each row represents a single supplier, and supplier_id appears
        to be unique across rows.
      data_type:
        current_data_type: INT
  - name: supplier_name
    description: Name of the supplier
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: Supplier#\d{9}
        summary: Supplier names in the format of "Supplier#" followed by a 9-digit
          number
      data_type:
        current_data_type: VARCHAR
  - name: city
    description: City where the supplier is located
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: '[A-Z\s]{3,14}\s\d'
        summary: Country name (uppercase, potentially truncated) followed by space
          and single digit
      - regex: '[A-Z\s]{5,11}\d'
        summary: Abbreviated country names (in uppercase) followed by a single digit
      data_type:
        current_data_type: VARCHAR
  - name: country
    description: Country where the supplier is based
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: region
    description: Geographical region of the supplier
    tests:
    - not_null
    - accepted_values:
        values:
        - ASIA
        - MIDDLE EAST
        - AFRICA
        - EUROPE
        - AMERICA
    cocoon_meta:
      future_accepted_values:
      - NORTH AMERICA
      - SOUTH AMERICA
      - OCEANIA
      data_type:
        current_data_type: VARCHAR
  - name: phone_number
    description: Contact phone number for the supplier
    tests:
    - not_null
    cocoon_meta:
      patterns:
      - regex: 1[01]-\d{3}-\d{3}-\d{4}
        summary: Phone numbers in the format of XX-XXX-XXX-XXXX, where X is a digit
      - regex: \d{2}-\d{3}-\d{3}-\d{4}
        summary: Phone numbers in the format of 12-xxx-xxx-xxxx where x represents
          digits
      data_type:
        current_data_type: VARCHAR
  - name: street_address
    description: Street address of the supplier
    tests:
    - not_null
    cocoon_meta:
      unusual_values: All values are unusual because they start with commas and consist
        of random alphanumeric strings instead of typical address components.
      patterns:
      - regex: ^,[\w\s]+$
        summary: Values starting with a comma followed by alphanumeric characters
          and spaces
      - regex: ^\d+,[\w\s]+$
        summary: Values starting with digits followed by a comma and alphanumeric
          characters
      - regex: ^[\w\s,]+$
        summary: Alphanumeric strings with possible spaces and commas
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_lineorder

Table Preview (first 5 rows)

order_priority quantity shipping_mode line_number shipping_priority product_id order_id customer_id supplier_id commit_date discount_percentage line_item_revenue line_item_total order_date order_total supply_cost tax_percentage
0 None 13 REG AIR 7 0 99546 277316 19190 1170 1995-05-25 0.0 2009202.0 2009202.0 1995-02-28 23455557.0 92732.0 3.0
1 1-URGENT 14 SHIP 4 0 48214 5929219 28511 1427 1998-09-22 4.0 1562010.0 1627094.0 1998-07-02 20688659.0 69732.0 0.0
2 2-HIGH 4 FOB 4 0 100325 3693318 15811 1170 1992-10-13 9.0 482416.0 530128.0 1992-08-07 19688727.0 79519.0 0.0
3 None 15 MAIL 2 0 126666 2765669 2258 35 1992-07-03 8.0 2335870.0 2538990.0 1992-05-04 3523267.0 101559.0 8.0
4 1-URGENT 17 TRUCK 1 0 172615 686017 23836 804 1995-12-13 7.0 2668111.0 2868937.0 1995-10-14 4014677.0 101256.0 8.0

stg_lineorder.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-26 00:50:02.289320+00:00
WITH 
"lineorder_renamed" AS (
    -- Rename: Renaming columns
    -- LO_ORDERKEY -> order_id
    -- LO_LINENUMBER -> line_number
    -- LO_CUSTKEY -> customer_id
    -- LO_PARTKEY -> product_id
    -- LO_SUPPKEY -> supplier_id
    -- LO_ORDERDATE -> order_date
    -- LO_ORDERPRIORITY -> order_priority
    -- LO_SHIPPRIORITY -> shipping_priority
    -- LO_QUANTITY -> quantity
    -- LO_EXTENDEDPRICE -> line_item_total
    -- LO_ORDTOTALPRICE -> order_total
    -- LO_DISCOUNT -> discount_percentage
    -- LO_REVENUE -> line_item_revenue
    -- LO_SUPPLYCOST -> supply_cost
    -- LO_TAX -> tax_percentage
    -- LO_COMMITDATE -> commit_date
    -- LO_SHIPMODE -> shipping_mode
    SELECT 
        "LO_ORDERKEY" AS "order_id",
        "LO_LINENUMBER" AS "line_number",
        "LO_CUSTKEY" AS "customer_id",
        "LO_PARTKEY" AS "product_id",
        "LO_SUPPKEY" AS "supplier_id",
        "LO_ORDERDATE" AS "order_date",
        "LO_ORDERPRIORITY" AS "order_priority",
        "LO_SHIPPRIORITY" AS "shipping_priority",
        "LO_QUANTITY" AS "quantity",
        "LO_EXTENDEDPRICE" AS "line_item_total",
        "LO_ORDTOTALPRICE" AS "order_total",
        "LO_DISCOUNT" AS "discount_percentage",
        "LO_REVENUE" AS "line_item_revenue",
        "LO_SUPPLYCOST" AS "supply_cost",
        "LO_TAX" AS "tax_percentage",
        "LO_COMMITDATE" AS "commit_date",
        "LO_SHIPMODE" AS "shipping_mode"
    FROM "memory"."main"."lineorder"
),

"lineorder_renamed_null" AS (
    -- NULL Imputation: Impute Null to Disguised Missing Values
    -- order_priority: ['4-NOT SPECIFIED']
    SELECT 
        CASE
            WHEN "order_priority" = '4-NOT SPECIFIED' THEN NULL
            ELSE "order_priority"
        END AS "order_priority",
        "supply_cost",
        "quantity",
        "line_item_total",
        "tax_percentage",
        "commit_date",
        "shipping_mode",
        "line_number",
        "shipping_priority",
        "product_id",
        "discount_percentage",
        "order_id",
        "customer_id",
        "order_total",
        "line_item_revenue",
        "order_date",
        "supplier_id"
    FROM "lineorder_renamed"
),

"lineorder_renamed_null_casted" AS (
    -- Column Type Casting: 
    -- commit_date: from INT to DATE
    -- discount_percentage: from INT to DECIMAL
    -- line_item_revenue: from INT to DECIMAL
    -- line_item_total: from INT to DECIMAL
    -- order_date: from INT to DATE
    -- order_total: from INT to DECIMAL
    -- supply_cost: from INT to DECIMAL
    -- tax_percentage: from INT to DECIMAL
    SELECT
        "order_priority",
        "quantity",
        "shipping_mode",
        "line_number",
        "shipping_priority",
        "product_id",
        "order_id",
        "customer_id",
        "supplier_id",
        strptime(CAST("commit_date" AS VARCHAR), '%Y%m%d') 
        AS "commit_date",
        CAST("discount_percentage" AS DECIMAL) 
        AS "discount_percentage",
        CAST("line_item_revenue" AS DECIMAL) 
        AS "line_item_revenue",
        CAST("line_item_total" AS DECIMAL) 
        AS "line_item_total",
        strptime(CAST("order_date" AS VARCHAR), '%Y%m%d') 
        AS "order_date",
        CAST("order_total" AS DECIMAL) 
        AS "order_total",
        CAST("supply_cost" AS DECIMAL) 
        AS "supply_cost",
        CAST("tax_percentage" AS DECIMAL) 
        AS "tax_percentage"
    FROM "lineorder_renamed_null"
)

-- COCOON BLOCK END
SELECT *
FROM "lineorder_renamed_null_casted"

stg_lineorder.yml (Document the table)

version: 2
models:
- name: stg_lineorder
  description: The table is about line items in orders. It includes order details
    like order key, line number, customer key, part key, supplier key, and order date.
    It also contains shipping information, quantities, prices, discounts, and revenue.
    The table provides a comprehensive view of each line item in an order, including
    financial and logistical aspects.
  columns:
  - name: order_priority
    description: Priority level of the order
    tests:
    - accepted_values:
        values:
        - 5-LOW
        - 2-HIGH
        - 1-URGENT
        - 3-MEDIUM
    cocoon_meta:
      missing_reason: Not applicable for non-priority or standard orders.
      future_accepted_values:
      - 4-NORMAL
      data_type:
        current_data_type: VARCHAR
  - name: quantity
    description: Quantity of items ordered
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: shipping_mode
    description: Mode of shipping for the line item
    tests:
    - not_null
    - accepted_values:
        values:
        - REG AIR
        - FOB
        - SHIP
        - TRUCK
        - AIR
        - RAIL
        - MAIL
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: line_number
    description: Line item number within an order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: shipping_priority
    description: Priority level for shipping the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: product_id
    description: Unique identifier for the product or part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: order_id
    description: Unique identifier for each order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: customer_id
    description: Unique identifier for the customer
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: supplier_id
    description: Unique identifier for the supplier
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: commit_date
    description: Committed delivery date for the line item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: TIMESTAMP
        expected_data_type: DATE
  - name: discount_percentage
    description: Discount percentage applied to the line item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: line_item_revenue
    description: Revenue generated from the line item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: line_item_total
    description: Total price for the line item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: order_date
    description: Date when the order was placed
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: TIMESTAMP
        expected_data_type: DATE
  - name: order_total
    description: Total price for the entire order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: supply_cost
    description: Cost of supplying the item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: tax_percentage
    description: Tax percentage applied to the line item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  tests: []
We identify the primary key (PK) and foreign key (FK) from tables. We build a join graph that connects FK to PK.

Join Graph (FK to PK)

%3 stg_lineorder_0 stg_lineorder stg_supplier_1 stg_supplier stg_lineorder_0->stg_supplier_1 stg_part_2 stg_part stg_lineorder_0->stg_part_2 stg_customer_3 stg_customer stg_lineorder_0->stg_customer_3 stg_date_4 stg_date

cocoon_join.yml (Document the joins)

join_graph:
- table_name: stg_customer
  primary_key: customer_id
  foreign_keys: []
- table_name: stg_lineorder
  foreign_keys:
  - column: customer_id
    reference:
      table_name: stg_customer
      column: customer_id
  - column: product_id
    reference:
      table_name: stg_part
      column: part_id
  - column: supplier_id
    reference:
      table_name: stg_supplier
      column: supplier_id
  time_keys:
  - commit_date
  - order_date
- table_name: stg_part
  primary_key: part_id
  foreign_keys: []
- table_name: stg_supplier
  primary_key: supplier_id
  foreign_keys: []
- table_name: stg_date
  time_keys:
  - full_date
We illustrate the step-by-step process behind the data.

All

Process Story

cocoon_er.yml (Document the entity relationships)

groups:
- group_name: DateDimension
  group_summary: This group focuses on date-related information, providing a comprehensive
    breakdown of calendar dates and their various attributes for data analysis and
    reporting purposes.
  tables:
  - stg_date
  join_info: As there is only one table in this group, no join information is applicable.
    However, this table can potentially be used to join with other tables that have
    date-related fields using the date_key or full_date attributes.
entities:
- entity_name: Customers
  entity_description: Represents individual customers with their personal and contact
    information.
  table_name: stg_customer
  primary_key: customer_id
- entity_name: Parts
  entity_description: Represents individual parts or products with their specifications
    and classifications.
  table_name: stg_part
  primary_key: part_id
- entity_name: Suppliers
  entity_description: Represents individual suppliers with their company and contact
    information.
  table_name: stg_supplier
  primary_key: supplier_id
relations:
- relation_name: OrderLineItems
  relation_description: Customers place orders for Parts, which are supplied by Suppliers,
    creating line items in the order.
  table_name: stg_lineorder
  entities:
  - Customers
  - Parts
  - Suppliers
story:
- name: DateDimension
  description: Calendar dates track order timelines and supplier deliveries.
  type: group
- name: OrderLineItems
  description: Customers order parts from suppliers, creating detailed invoices.
  type: relation