We display the source tables from the data warehouses to model.

customer

Table Preview(first 5 rows)

C_CUSTKEY C_NAME C_ADDRESS C_NATIONKEY C_PHONE C_ACCTBAL C_MKTSEGMENT C_COMMENT
0 IVhzIApeRb ot c E 15 9-567-846-8931 711.56 HOUSEHOLD regular deposits detect blithely agai
1 XSTf4 NCwDVaWNe6tEgvwfmRchLXak 13 23-768-687-3665 121.65 AUTOMOBILE pending accounts haggle furiously deposit None
2 MG9kdTD2WBHm 1 11-719-748-3364 7498.12 AUTOMOBILE special packages hang ironic silent packages None
3 XxVSJsLAGtn 4 14-128-190-5944 2866.83 MACHINERY pending requests wake carefully express even accounts None
4 KvpyuHCplrB84WgAiGV6sYpZq7Tj 3 13-750-942-6364 794.47 HOUSEHOLD accounts wake furiously even instructions None None

lineitem

Table Preview(first 5 rows)

L_ORDERKEY L_PARTKEY L_SUPPKEY L_LINENUMBER L_QUANTITY L_EXTENDEDPRICE L_DISCOUNT L_TAX L_RETURNFLAG L_LINESTATUS L_SHIPDATE L_COMMITDATE L_RECEIPTDATE L_SHIPINSTRUCT L_SHIPMODE L_COMMENT
0 1 1552 93 1 17 24710.35 0.04 0.02 N O 1996-03-13 1996-02-12 1996-03-22 DELIVER IN PERSON TRUCK egular courts above the
1 1 674 75 2 36 56688.00 0.09 0.06 N O 1996-04-12 1996-02-28 1996-04-20 TAKE BACK RETURN MAIL ly final dependencies: slyly bold
2 2 1 99 1 38 37850.00 0.00 0.05 N O 1997-01-28 1997-01-14 1997-02-02 TAKE BACK RETURN RAIL ven requests. deposits breach a
3 3 355 56 1 8 8376.00 0.07 0.02 R F 1994-02-02 1994-01-04 1994-02-23 NONE AIR ongside of the furiously brave acco
4 3 278 79 2 42 47352.00 0.08 0.04 R F 1994-01-13 1994-02-09 1994-01-27 DELIVER IN PERSON TRUCK blithely bold excuses wake fluffily

nation

Table Preview(first 5 rows)

N_NATIONKEY N_NAME N_REGIONKEY N_COMMENT
0 0 ALGERIA 0 final accounts wake quickly. special reques
1 1 ARGENTINA 1 idly final instructions cajole stealthily. regular instructions wake carefully blithely express accounts. fluffily bold excuses
2 2 BRAZIL 1 always pending pinto beans sleep sil
3 3 CANADA 1 foxes among the bold requests ideally blithely final dependencies. furiously express accounts wake blithely
4 4 EGYPT 4 fluffily unusual packages wake blithe

order

Table Preview(first 5 rows)

O_ORDERKEY O_CUSTKEY O_ORDERSTATUS O_TOTALPRICE O_ORDERDATE O_ORDERPRIORITY O_CLERK O_SHIPPRIORITY O_COMMENT
0 1 370 O 172799.49 1996-01-02 5-LOW Clerk#000000951 0 nstructions sleep furiously among
1 2 781 O 38426.09 1996-12-01 1-URGENT Clerk#000000880 0 foxes. pending accounts at the pending silent asymptot
2 3 1234 F 205654.30 1993-10-14 5-LOW Clerk#000000955 0 sly final accounts boost. carefully regular ideas cajole carefully. depos
3 4 1369 O 56000.91 1995-10-11 5-LOW Clerk#000000124 0 sits. slyly regular warthogs cajole. regular regular theodolites acro
4 5 445 F 105367.67 1994-07-30 5-LOW Clerk#000000925 0 quickly. bold deposits sleep slyly. packages use slyly

part

Table Preview(first 5 rows)

P_PARTKEY P_NAME P_MFGR P_BRAND P_TYPE P_SIZE P_CONTAINER P_RETAILPRICE P_COMMENT
0 1 goldenrod lavender spring chocolate lace Manufacturer#1 Brand#13 PROMO BURNISHED COPPER 7 JUMBO PKG 901.0 ly. slyly ironi
1 2 blush thistle blue yellow saddle Manufacturer#1 Brand#13 LARGE BRUSHED BRASS 1 LG CASE 902.0 lar accounts amo
2 3 spring green yellow purple cornsilk Manufacturer#4 Brand#42 STANDARD POLISHED BRASS 21 WRAP CASE 903.0 egular deposits hag
3 4 cornflower chocolate smoke green pink Manufacturer#3 Brand#34 SMALL PLATED BRASS 14 MED DRUM 904.0 p furiously r
4 5 forest brown coral puff cream Manufacturer#3 Brand#32 STANDARD POLISHED TIN 15 SM PKG 905.0 wake carefully

partsupp

Table Preview(first 5 rows)

PS_PARTKEY PS_SUPPKEY PS_AVAILQTY PS_SUPPLYCOST PS_COMMENT
0 1 1 3325 771.64 final theodolites. special packages above the quickly ironic waters cajo
1 1 2 8076 993.49 ven ideas. quickly even packages print. pending multipliers
2 2 3 3956 337.09 fluffily ironic deposits are quickly pending accounts. furiously final foxes are slyly. silent requests sleep quickly ironic ideas
3 2 4 4069 357.84 carefully regular excuses. carefully silent pinto beans sleep. quickly even deposits sleep quickly furiously pending dolphins
4 3 5 4673 647.59 quickly special packages sleep carefully above the ironic deposits

region

Table Preview(first 5 rows)

R_REGIONKEY R_NAME R_COMMENT
0 0 AFRICA lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to
1 1 AMERICA hs use ironic even requests. s
2 2 ASIA ges. thinly even pinto beans ca
3 3 EUROPE ly final courts cajole furiously final excuse
4 4 MIDDLE EAST uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl

supplier

Table Preview(first 5 rows)

S_SUPPKEY S_NAME S_ADDRESS S_NATIONKEY S_PHONE S_ACCTBAL S_COMMENT
0 Supplier#000000001 N kD4on9OM Ipw3 gf0JBoQDd7tgrzrddZ 17 27-918-335-1736 5755.94 each slyly above the careful
1 Supplier#000000002 89eJ5ksX3ImxJQBvxObC 5 15-679-861-2259 4032.68 slyly bold instructions. idle deposi None
2 Supplier#000000003 q1 G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3 1 11-383-516-1199 4192.40 blithely silent requests after the express dependencies are sl
3 Supplier#000000004 Bk7ah4CK8SYQTepEmvMkkgMwg 15 25-843-787-7479 4641.08 riously even requests above the exp None
4 Supplier#000000005 Gcdm2rJRzl5qlTVzc 4 14-606-487-0570 9915.24 quickly above the quickly ironic deposits affix None
We display the stage tables, their SQL transformations, and their configurations.

stg_partsupp

Table Preview (first 5 rows)

part_key supplier_key available_quantity supply_cost comment
0 1 1 3325 771.64 final theodolites. special packages above the quickly ironic waters cajo
1 1 2 8076 993.49 ven ideas. quickly even packages print. pending multipliers
2 2 3 3956 337.09 fluffily ironic deposits are quickly pending accounts. furiously final foxes are slyly. silent requests sleep quickly ironic ideas
3 2 4 4069 357.84 carefully regular excuses. carefully silent pinto beans sleep. quickly even deposits sleep quickly furiously pending dolphins
4 3 5 4673 647.59 quickly special packages sleep carefully above the ironic deposits

stg_partsupp.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:26:38.079237+00:00
WITH 
"partsupp_renamed" AS (
    -- Rename: Renaming columns
    -- PS_PARTKEY -> part_key
    -- PS_SUPPKEY -> supplier_key
    -- PS_AVAILQTY -> available_quantity
    -- PS_SUPPLYCOST -> supply_cost
    -- PS_COMMENT -> comment
    SELECT 
        "PS_PARTKEY" AS "part_key",
        "PS_SUPPKEY" AS "supplier_key",
        "PS_AVAILQTY" AS "available_quantity",
        "PS_SUPPLYCOST" AS "supply_cost",
        "PS_COMMENT" AS "comment"
    FROM "memory"."main"."partsupp"
)

-- COCOON BLOCK END
SELECT *
FROM "partsupp_renamed"

stg_partsupp.yml (Document the table)

version: 2
models:
- name: stg_partsupp
  description: The table is the PartSupplier relation between parts and suppliers.
    It contains information about the availability and cost of parts from specific
    suppliers. Each row represents a part-supplier combination with details like available
    quantity, supply cost, and a comment. The primary keys are PS_PARTKEY and PS_SUPPKEY,
    linking to the Part and Supplier tables respectively.
  columns:
  - name: part_key
    description: Part key, links to Part table
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: supplier_key
    description: Supplier key, links to Supplier table
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: available_quantity
    description: Available quantity of the part from the supplier
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: supply_cost
    description: Cost of the part from the supplier
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: comment
    description: Additional comments or notes about the part-supplier relationship
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_part

Table Preview (first 5 rows)

part_id part_name manufacturer brand part_type size_ packaging retail_price comments
0 1 goldenrod lavender spring chocolate lace Manufacturer#1 Brand#13 PROMO BURNISHED COPPER 7 JUMBO PKG 901.0 None
1 2 blush thistle blue yellow saddle Manufacturer#1 Brand#13 LARGE BRUSHED BRASS 1 LG CASE 902.0 None
2 3 spring green yellow purple cornsilk Manufacturer#4 Brand#42 STANDARD POLISHED BRASS 21 WRAP CASE 903.0 None
3 4 cornflower chocolate smoke green pink Manufacturer#3 Brand#34 SMALL PLATED BRASS 14 MED DRUM 904.0 None
4 5 forest brown coral puff cream Manufacturer#3 Brand#32 STANDARD POLISHED TIN 15 SM PKG 905.0 None

stg_part.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:26:17.331735+00:00
WITH 
"part_renamed" AS (
    -- Rename: Renaming columns
    -- P_PARTKEY -> part_id
    -- P_NAME -> part_name
    -- P_MFGR -> manufacturer
    -- P_BRAND -> brand
    -- P_TYPE -> part_type
    -- P_SIZE -> size_
    -- P_CONTAINER -> packaging
    -- P_RETAILPRICE -> retail_price
    -- P_COMMENT -> comments
    SELECT 
        "P_PARTKEY" AS "part_id",
        "P_NAME" AS "part_name",
        "P_MFGR" AS "manufacturer",
        "P_BRAND" AS "brand",
        "P_TYPE" AS "part_type",
        "P_SIZE" AS "size_",
        "P_CONTAINER" AS "packaging",
        "P_RETAILPRICE" AS "retail_price",
        "P_COMMENT" AS "comments"
    FROM "memory"."main"."part"
),

"part_renamed_cleaned" AS (
    -- Clean unusual string values: 
    -- comments: The problem is that all values in the comments column are incomplete fragments of sentences, making their meaning unclear. These fragments appear to be cut-off portions of longer comments. Without the full context, it's impossible to determine the intended complete sentences. The correct approach is to treat these as corrupted data and map them to empty strings, as we cannot accurately reconstruct the original comments.
    SELECT
        "part_id",
        "part_name",
        "manufacturer",
        "brand",
        "part_type",
        "size_",
        "packaging",
        "retail_price",
        CASE
            WHEN "comments" = 'egular deposits hag' THEN NULL
            WHEN "comments" = 'lar accounts amo' THEN NULL
            WHEN "comments" = 'ly. slyly ironi' THEN NULL
            WHEN "comments" = 'p furiously r' THEN NULL
            WHEN "comments" = 'wake carefully' THEN NULL
            ELSE "comments"
        END AS "comments"
    FROM "part_renamed"
)

-- COCOON BLOCK END
SELECT *
FROM "part_renamed_cleaned"

stg_part.yml (Document the table)

version: 2
models:
- name: stg_part
  description: The table is about product parts. It includes details such as part
    key, name, manufacturer, brand, type, size, container, retail price, and a comment.
    Each row represents a unique part with its specific attributes. The parts vary
    in characteristics like color, material, and packaging type. Pricing seems to
    increase slightly with each part key.
  columns:
  - name: part_id
    description: Unique identifier for the part
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each part. For
        this table, each row represents a unique part. part_id is unique across rows
        and incrementing, suggesting it's designed to be a primary key.
      data_type:
        current_data_type: INT
  - name: part_name
    description: Descriptive name of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: manufacturer
    description: Manufacturer of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: brand
    description: Brand of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: part_type
    description: Type and material of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: size_
    description: Size of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: packaging
    description: Container or packaging type for the part
    tests:
    - not_null
    - accepted_values:
        values:
        - JUMBO PKG
        - LG CASE
        - MED DRUM
        - SM PKG
        - WRAP CASE
    cocoon_meta:
      future_accepted_values:
      - XL DRUM
      - PALLET
      - CRATE
      - BOX
      - CARTON
      - TUBE
      - BAG
      - ROLL
      - BUNDLE
      data_type:
        current_data_type: VARCHAR
  - name: retail_price
    description: Retail price of the part
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: comments
    description: Additional comments about the part
    cocoon_meta:
      missing_reason: Optional field for additional information, not always needed.
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_nation

Table Preview (first 5 rows)

nation_id country_name region_id country_description
0 0 ALGERIA 0 final accounts wake quickly. special reques
1 1 ARGENTINA 1 idly final instructions cajole stealthily. regular instructions wake carefully blithely express accounts. fluffily bold excuses
2 2 BRAZIL 1 always pending pinto beans sleep sil
3 3 CANADA 1 foxes among the bold requests ideally blithely final dependencies. furiously express accounts wake blithely
4 4 EGYPT 4 fluffily unusual packages wake blithe

stg_nation.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:24:16.330008+00:00
WITH 
"nation_renamed" AS (
    -- Rename: Renaming columns
    -- N_NATIONKEY -> nation_id
    -- N_NAME -> country_name
    -- N_REGIONKEY -> region_id
    -- N_COMMENT -> country_description
    SELECT 
        "N_NATIONKEY" AS "nation_id",
        "N_NAME" AS "country_name",
        "N_REGIONKEY" AS "region_id",
        "N_COMMENT" AS "country_description"
    FROM "memory"."main"."nation"
)

-- COCOON BLOCK END
SELECT *
FROM "nation_renamed"

stg_nation.yml (Document the table)

version: 2
models:
- name: stg_nation
  description: The table is about nations, with details of each country. It includes
    a unique nation key, the country's name, a region key (likely linking to a separate
    region table), and a brief comment about the country. The comment seems to be
    a random phrase or description, possibly for testing or placeholder purposes.
  columns:
  - name: nation_id
    description: Unique identifier for each nation
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each nation. For
        this table, each row represents a distinct country, and nation_id appears
        to be unique across rows.
      data_type:
        current_data_type: INT
  - name: country_name
    description: Name of the country
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column contains the name of each country. While country
        names are generally unique in the real world, there could potentially be duplicate
        names or variations. However, based on the given sample, it appears to be
        unique.
      data_type:
        current_data_type: VARCHAR
  - name: region_id
    description: Foreign key linking to a region table
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: country_description
    description: Brief, seemingly random description about the country
    tests:
    - not_null
    cocoon_meta:
      unusual_values: Values are unusual because they are incomplete sentences, likely
        truncated from longer descriptions.
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_customer

Table Preview (first 5 rows)

customer_id customer_name address phone_number comments market_segment account_balance nation_id
0 IVhzIApeRbot None None 9-567-846-8931 regular deposits detect blithely agai HOUSEHOLD 711.56 15.0
1 XSTf4000000000000000000000000 None None None None None NaN NaN
2 MG9kdTD2WBHm0000000000000000 None 11-719-748-3364 None None None NaN NaN
3 XxVSJsLAGtn00000000000000000 None 14-128-190-5944 None None None NaN NaN
4 KvpyuHCplrB84WgAiGV6sYpZq7Tj None 13-750-942-6364 None None None NaN NaN

stg_customer.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:22:15.794717+00:00
WITH 
"customer_renamed" AS (
    -- Rename: Renaming columns
    -- C_CUSTKEY -> customer_id
    -- C_NAME -> customer_name
    -- C_ADDRESS -> address
    -- C_NATIONKEY -> nation_id
    -- C_PHONE -> phone_number
    -- C_ACCTBAL -> account_balance
    -- C_MKTSEGMENT -> market_segment
    -- C_COMMENT -> comments
    SELECT 
        "C_CUSTKEY" AS "customer_id",
        "C_NAME" AS "customer_name",
        "C_ADDRESS" AS "address",
        "C_NATIONKEY" AS "nation_id",
        "C_PHONE" AS "phone_number",
        "C_ACCTBAL" AS "account_balance",
        "C_MKTSEGMENT" AS "market_segment",
        "C_COMMENT" AS "comments"
    FROM "memory"."main"."customer"
),

"customer_renamed_trimmed" AS (
    -- Trim Leading and Trailing Spaces
    SELECT
        "customer_id",
        "customer_name",
        "address",
        "nation_id",
        "phone_number",
        "account_balance",
        "comments",
        TRIM("market_segment") AS "market_segment"
    FROM "customer_renamed"
),

"customer_renamed_trimmed_cleaned" AS (
    -- Clean unusual string values: 
    -- customer_id: The problem is that the customer_id values have inconsistent patterns and lengths. The most unusual value is 'IVhzIApeRb ot' because it contains a space, which is atypical for customer IDs. The other values have varying lengths and character patterns. The correct values should follow a consistent format, ideally without spaces and with a standard length. Since 'KvpyuHCplrB84WgAiGV6sYpZq7Tj' is the longest and most complex ID, it appears to be the most likely format for a proper customer ID.
    -- customer_name: The problem is that the customer_name column contains values that are not typical names.  The values '1', '3', '4', and 'c' appear to be placeholders or data entry errors.  'NCwDVaWNe6tEgvwfmRchLXak' looks like a random string, possibly an ID that was mistakenly placed in the name column.  None of these are valid customer names. The correct approach would be to replace these  invalid entries with an empty string, indicating missing data, as we don't have enough  information to determine the actual customer names.
    -- address: The problem is that two values ('E' and '13') do not follow the phone number format of the other entries. The correct values should be in the format XX-XXX-XXX-XXXX, where X represents digits. The value 'E' is completely invalid as a phone number, while '13' is likely a partial or incorrect entry.
    -- nation_id: The nation_id column contains inconsistent data types and formats. '15' appears to be a valid numeric ID, while '23-768-687-3665' looks like a phone number, and '2866.83', '7498.12', '794.47' are decimal numbers. These inconsistencies suggest data entry errors or misplaced information from other columns. Since '15' is the only value that resembles a typical integer ID format and is the most frequent, we'll consider it as the correct format for nation_id. The other values should be treated as errors and mapped to an empty string, as they don't contain valid nation ID information.
    -- phone_number: The problem is that the phone_number column contains values that are not phone numbers. The value '121.65' appears to be a decimal number, while 'AUTOMOBILE', 'HOUSEHOLD', and 'MACHINERY' seem to be category names, possibly from another column. Only '9-567-846-8931' resembles a valid phone number format. The correct values should all be phone numbers or empty strings if the actual phone number is unknown.
    -- account_balance: The problem is that the account_balance column contains non-numeric values which are inappropriate for an account balance. The only correct value is '711.56', which is a numeric balance. The other values are either a vehicle type ('AUTOMOBILE') or nonsensical sentences, which have no place in an account balance field. The correct values should all be numeric account balances or empty strings if the balance is unknown.
    -- market_segment: The problem is that the market_segment column contains some unusual and grammatically incorrect values. 'HOUSEHOLD' appears to be a valid market segment descriptor, while 'even accounts', 'pending accounts haggle furiously deposit', and 'silent packages' are not typical market segment descriptors and contain grammatical issues or nonsensical phrases. The correct values should be standardized market segment descriptors. Since 'HOUSEHOLD' is the only clear and valid segment, we'll map the unusual values to empty strings, as we don't have enough information to accurately categorize them into other segments.
    SELECT
        CASE
            WHEN "customer_id" = 'IVhzIApeRb ot' THEN 'IVhzIApeRbot'
            WHEN "customer_id" = 'XSTf4' THEN 'XSTf4000000000000000000000000'
            WHEN "customer_id" = 'XxVSJsLAGtn' THEN 'XxVSJsLAGtn00000000000000000'
            WHEN "customer_id" = 'MG9kdTD2WBHm' THEN 'MG9kdTD2WBHm0000000000000000'
            ELSE "customer_id"
        END AS "customer_id",
        CASE
            WHEN "customer_name" = '1' THEN NULL
            WHEN "customer_name" = '3' THEN NULL
            WHEN "customer_name" = '4' THEN NULL
            WHEN "customer_name" = 'c' THEN NULL
            WHEN "customer_name" = 'NCwDVaWNe6tEgvwfmRchLXak' THEN NULL
            ELSE "customer_name"
        END AS "customer_name",
        CASE
            WHEN "address" = 'E' THEN NULL
            WHEN "address" = '13' THEN NULL
            ELSE "address"
        END AS "address",
        CASE
            WHEN "nation_id" = '23-768-687-3665' THEN NULL
            WHEN "nation_id" = '2866.83' THEN NULL
            WHEN "nation_id" = '7498.12' THEN NULL
            WHEN "nation_id" = '794.47' THEN NULL
            ELSE "nation_id"
        END AS "nation_id",
        CASE
            WHEN "phone_number" = '121.65' THEN NULL
            WHEN "phone_number" = 'AUTOMOBILE' THEN NULL
            WHEN "phone_number" = 'HOUSEHOLD' THEN NULL
            WHEN "phone_number" = 'MACHINERY' THEN NULL
            ELSE "phone_number"
        END AS "phone_number",
        CASE
            WHEN "account_balance" = 'AUTOMOBILE' THEN NULL
            WHEN "account_balance" = 'accounts wake furiously even instructions' THEN NULL
            WHEN "account_balance" = 'pending requests wake carefully express' THEN NULL
            WHEN "account_balance" = 'special packages hang ironic' THEN NULL
            ELSE "account_balance"
        END AS "account_balance",
        "comments",
        CASE
            WHEN "market_segment" = 'even accounts' THEN NULL
            WHEN "market_segment" = 'pending accounts haggle furiously deposit' THEN NULL
            WHEN "market_segment" = 'silent packages' THEN NULL
            ELSE "market_segment"
        END AS "market_segment"
    FROM "customer_renamed_trimmed"
),

"customer_renamed_trimmed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- account_balance: from VARCHAR to DECIMAL
    -- nation_id: from VARCHAR to INT
    SELECT
        "customer_id",
        "customer_name",
        "address",
        "phone_number",
        "comments",
        "market_segment",
        CAST("account_balance" AS DECIMAL) 
        AS "account_balance",
        CAST("nation_id" AS INT) 
        AS "nation_id"
    FROM "customer_renamed_trimmed_cleaned"
)

-- COCOON BLOCK END
SELECT *
FROM "customer_renamed_trimmed_cleaned_casted"

stg_customer.yml (Document the table)

version: 2
models:
- name: stg_customer
  description: The table is about customers. It contains customer details like customer
    key, name, address, nation key, phone number, account balance, market segment,
    and comments. Each row represents a unique customer with their associated information.
    The table appears to be part of a larger database, possibly for a business tracking
    customer information across different market segments and nations.
  columns:
  - name: customer_id
    description: Unique identifier for each customer
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each customer.
        For this table, each row is for a specific customer, and customer_id is unique
        across rows.
      data_type:
        current_data_type: VARCHAR
  - name: customer_name
    description: Name of the customer
    cocoon_meta:
      missing_reason: Customer ID is used instead of name for identification.
      data_type:
        current_data_type: VARCHAR
  - name: address
    description: Physical address of the customer
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: phone_number
    description: Customer's phone number
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: comments
    description: Additional comments or notes about the customer
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: market_segment
    description: Market segment the customer belongs to
    tests:
    - accepted_values:
        values:
        - HOUSEHOLD
    cocoon_meta:
      missing_reason: Unknown
      future_accepted_values:
      - CORPORATE
      - SMALL BUSINESS
      - GOVERNMENT
      - EDUCATION
      - NON-PROFIT
      - HEALTHCARE
      - RETAIL
      - MANUFACTURING
      - FINANCIAL SERVICES
      - TECHNOLOGY
      - HOSPITALITY
      - TRANSPORTATION
      - ENERGY
      - AGRICULTURE
      - CONSTRUCTION
      data_type:
        current_data_type: VARCHAR
  - name: account_balance
    description: Customer's account balance
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: DECIMAL
  - name: nation_id
    description: Identifier for the customer's nation
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: INT
  tests: []

stg_order

Table Preview (first 5 rows)

order_id customer_id order_status total_price order_priority clerk_id shipping_priority order_comments order_date
0 1 370 O 172799.49 5-LOW Clerk#000000951 0 nstructions sleep furiously among 1996-01-02
1 2 781 O 38426.09 1-URGENT Clerk#000000880 0 foxes. pending accounts at the pending silent asymptot 1996-12-01
2 3 1234 F 205654.30 5-LOW Clerk#000000955 0 sly final accounts boost. carefully regular ideas cajole carefully. depos 1993-10-14
3 4 1369 O 56000.91 5-LOW Clerk#000000124 0 sits. slyly regular warthogs cajole. regular regular theodolites acro 1995-10-11
4 5 445 F 105367.67 5-LOW Clerk#000000925 0 quickly. bold deposits sleep slyly. packages use slyly 1994-07-30

stg_order.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:25:13.064323+00:00
WITH 
"order_renamed" AS (
    -- Rename: Renaming columns
    -- O_ORDERKEY -> order_id
    -- O_CUSTKEY -> customer_id
    -- O_ORDERSTATUS -> order_status
    -- O_TOTALPRICE -> total_price
    -- O_ORDERDATE -> order_date
    -- O_ORDERPRIORITY -> order_priority
    -- O_CLERK -> clerk_id
    -- O_SHIPPRIORITY -> shipping_priority
    -- O_COMMENT -> order_comments
    SELECT 
        "O_ORDERKEY" AS "order_id",
        "O_CUSTKEY" AS "customer_id",
        "O_ORDERSTATUS" AS "order_status",
        "O_TOTALPRICE" AS "total_price",
        "O_ORDERDATE" AS "order_date",
        "O_ORDERPRIORITY" AS "order_priority",
        "O_CLERK" AS "clerk_id",
        "O_SHIPPRIORITY" AS "shipping_priority",
        "O_COMMENT" AS "order_comments"
    FROM "memory"."main"."order"
),

"order_renamed_casted" AS (
    -- Column Type Casting: 
    -- order_date: from VARCHAR to DATE
    SELECT
        "order_id",
        "customer_id",
        "order_status",
        "total_price",
        "order_priority",
        "clerk_id",
        "shipping_priority",
        "order_comments",
        CAST("order_date" AS DATE) 
        AS "order_date"
    FROM "order_renamed"
)

-- COCOON BLOCK END
SELECT *
FROM "order_renamed_casted"

stg_order.yml (Document the table)

version: 2
models:
- name: stg_order
  description: The table is about orders. It includes details such as order key, customer
    key, order status, total price, order date, priority, clerk, shipping priority,
    and comments. Each row represents a unique order with its associated information.
    The table captures various aspects of orders placed by customers, including financial
    details, timing, and processing information.
  columns:
  - name: order_id
    description: Unique identifier for each order
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each order. For
        this table, each row represents a distinct order, and order_id is unique across
        rows.
      data_type:
        current_data_type: INT
  - name: customer_id
    description: Customer identifier associated with the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: order_status
    description: Current status of the order
    tests:
    - not_null
    - accepted_values:
        values:
        - O
        - F
    cocoon_meta:
      future_accepted_values:
      - P
      - S
      - D
      - C
      - R
      data_type:
        current_data_type: VARCHAR
  - name: total_price
    description: Total price of the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: order_priority
    description: Priority level of the order
    tests:
    - not_null
    - accepted_values:
        values:
        - 5-LOW
        - 1-URGENT
    cocoon_meta:
      future_accepted_values:
      - 2-HIGH
      - 3-MEDIUM
      - 4-NORMAL
      data_type:
        current_data_type: VARCHAR
  - name: clerk_id
    description: Identifier of the clerk processing the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: shipping_priority
    description: Shipping priority of the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: order_comments
    description: Additional comments or notes about the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  - name: order_date
    description: Date when the order was placed
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DATE
  tests: []

stg_supplier

Table Preview (first 5 rows)

supplier_id supplier_name supplier_address supplier_phone supplier_comments account_balance nation_id
0 Supplier#000000001 None None 27-918-335-1736 None 5755.94 17.0
1 Supplier#000000002 None 5 None None NaN NaN
2 Supplier#000000003 None None 11-383-516-1199 None 4192.40 1.0
3 Supplier#000000004 None 15 None None NaN NaN
4 Supplier#000000005 None 4 None None NaN NaN

stg_supplier.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:28:27.866538+00:00
WITH 
"supplier_renamed" AS (
    -- Rename: Renaming columns
    -- S_SUPPKEY -> supplier_id
    -- S_NAME -> supplier_name
    -- S_ADDRESS -> supplier_address
    -- S_NATIONKEY -> nation_id
    -- S_PHONE -> supplier_phone
    -- S_ACCTBAL -> account_balance
    -- S_COMMENT -> supplier_comments
    SELECT 
        "S_SUPPKEY" AS "supplier_id",
        "S_NAME" AS "supplier_name",
        "S_ADDRESS" AS "supplier_address",
        "S_NATIONKEY" AS "nation_id",
        "S_PHONE" AS "supplier_phone",
        "S_ACCTBAL" AS "account_balance",
        "S_COMMENT" AS "supplier_comments"
    FROM "memory"."main"."supplier"
),

"supplier_renamed_trimmed" AS (
    -- Trim Leading and Trailing Spaces
    SELECT
        "supplier_id",
        "supplier_name",
        "supplier_address",
        "nation_id",
        "supplier_phone",
        "supplier_comments",
        TRIM("account_balance") AS "account_balance"
    FROM "supplier_renamed"
),

"supplier_renamed_trimmed_cleaned" AS (
    -- Clean unusual string values: 
    -- supplier_name: The problem is that all values in the supplier_name column appear to be encrypted, hashed, or randomly generated strings. These do not resemble typical supplier names, which would usually be recognizable company or individual names. Without additional information to decode these strings, it's impossible to determine the correct supplier names. Therefore, the best approach is to map all these unusual values to an empty string, indicating that the actual supplier name is unknown or unavailable.
    -- supplier_address: The problem is that the supplier_address column contains unusual values. The values '15', '4', and '5' are unusually short for addresses and likely represent incomplete data. The values 'G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3' and 'gf0JBoQDd7tgrzrddZ' appear to be random strings rather than valid addresses. These random strings are meaningless and should be replaced with empty strings to indicate missing data. The short numeric values should be kept as is, as they might represent valid partial address information (like building numbers).
    -- nation_id: The problem is that some values in the nation_id column resemble phone numbers (14-606-487-0570, 15-679-861-2259, 25-843-787-7479) rather than typical nation IDs. The correct values should be short numeric codes. The most frequent valid value is '1', and there's also a '17', which follows the same pattern of being a short numeric code. We'll map the phone number-like values to empty strings as they don't seem to represent valid nation IDs.
    -- supplier_phone: The problem is that the supplier_phone column contains three decimal numbers (4032.68, 4641.08, and 9915.24) which are not valid phone numbers. The correct values should be in the format of xx-xxx-xxx-xxxx, as seen in the two most frequent entries. Since these decimal numbers are meaningless in the context of phone numbers, they should be mapped to an empty string.
    -- supplier_comments: The problem is that the values in the supplier_comments column are incomplete sentences that seem to be truncated. They lack proper context and meaning. The correct values should be complete sentences or meaningful phrases related to supplier comments. However, without more context or the full dataset, it's difficult to determine what the complete sentences should be.
    -- account_balance: The problem is that some entries in the account_balance column contain text descriptions instead of numeric values. Account balances should typically be numeric values representing monetary amounts. The correct values are the numeric entries like '4192.40' and '5755.94'. The text entries are meaningless in this context and should be removed or replaced with empty strings.
    SELECT
        "supplier_id",
        CASE
            WHEN "supplier_name" = '89eJ5ksX3ImxJQBvxObC' THEN NULL
            WHEN "supplier_name" = 'Bk7ah4CK8SYQTepEmvMkkgMwg' THEN NULL
            WHEN "supplier_name" = 'Gcdm2rJRzl5qlTVzc' THEN NULL
            WHEN "supplier_name" = 'N kD4on9OM Ipw3' THEN NULL
            WHEN "supplier_name" = 'q1' THEN NULL
            ELSE "supplier_name"
        END AS "supplier_name",
        CASE
            WHEN "supplier_address" = 'G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3' THEN NULL
            WHEN "supplier_address" = 'gf0JBoQDd7tgrzrddZ' THEN NULL
            ELSE "supplier_address"
        END AS "supplier_address",
        CASE
            WHEN "nation_id" = '14-606-487-0570' THEN NULL
            WHEN "nation_id" = '15-679-861-2259' THEN NULL
            WHEN "nation_id" = '25-843-787-7479' THEN NULL
            ELSE "nation_id"
        END AS "nation_id",
        CASE
            WHEN "supplier_phone" = '4032.68' THEN NULL
            WHEN "supplier_phone" = '4641.08' THEN NULL
            WHEN "supplier_phone" = '9915.24' THEN NULL
            ELSE "supplier_phone"
        END AS "supplier_phone",
        CASE
            WHEN "supplier_comments" = 'blithely silent requests after the express dependencies are sl' THEN NULL
            WHEN "supplier_comments" = 'each slyly above the careful' THEN NULL
            ELSE "supplier_comments"
        END AS "supplier_comments",
        CASE
            WHEN "account_balance" = 'quickly above the quickly ironic deposits affix' THEN NULL
            WHEN "account_balance" = 'riously even requests above the exp' THEN NULL
            WHEN "account_balance" = 'slyly bold instructions. idle deposi' THEN NULL
            ELSE "account_balance"
        END AS "account_balance"
    FROM "supplier_renamed_trimmed"
),

"supplier_renamed_trimmed_cleaned_casted" AS (
    -- Column Type Casting: 
    -- account_balance: from VARCHAR to DECIMAL
    -- nation_id: from VARCHAR to INT
    SELECT
        "supplier_id",
        "supplier_name",
        "supplier_address",
        "supplier_phone",
        "supplier_comments",
        CAST("account_balance" AS DECIMAL) 
        AS "account_balance",
        CAST("nation_id" AS INT) 
        AS "nation_id"
    FROM "supplier_renamed_trimmed_cleaned"
)

-- COCOON BLOCK END
SELECT *
FROM "supplier_renamed_trimmed_cleaned_casted"

stg_supplier.yml (Document the table)

version: 2
models:
- name: stg_supplier
  description: The table is about suppliers. It contains details of each supplier,
    including their unique key, name, address, nation key, phone number, account balance,
    and a comment. Each row represents a different supplier with their specific information.
    The table appears to be part of a larger database system, likely for managing
    supplier information in a business context.
  columns:
  - name: supplier_id
    description: Unique identifier for the supplier
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each supplier.
        For this table, each row represents a different supplier, and supplier_id
        appears to be unique across rows (e.g., Supplier#000000001, Supplier#000000002).
      data_type:
        current_data_type: VARCHAR
  - name: supplier_name
    description: Name of the supplier
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: supplier_address
    description: Address of the supplier
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: supplier_phone
    description: Phone number of the supplier
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: supplier_comments
    description: Additional comments or notes about the supplier
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: VARCHAR
  - name: account_balance
    description: Account balance of the supplier
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: DECIMAL
  - name: nation_id
    description: Key representing the supplier's nation
    cocoon_meta:
      missing_reason: Unknown
      data_type:
        current_data_type: INT
  tests: []

stg_region

Table Preview (first 5 rows)

region_id region_name random_comment
0 0 AFRICA lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to
1 1 AMERICA hs use ironic even requests. s
2 2 ASIA ges. thinly even pinto beans ca
3 3 EUROPE ly final courts cajole furiously final excuse
4 4 MIDDLE EAST uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl

stg_region.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:27:04.921327+00:00
WITH 
"region_renamed" AS (
    -- Rename: Renaming columns
    -- R_REGIONKEY -> region_id
    -- R_NAME -> region_name
    -- R_COMMENT -> random_comment
    SELECT 
        "R_REGIONKEY" AS "region_id",
        "R_NAME" AS "region_name",
        "R_COMMENT" AS "random_comment"
    FROM "memory"."main"."region"
)

-- COCOON BLOCK END
SELECT *
FROM "region_renamed"

stg_region.yml (Document the table)

version: 2
models:
- name: stg_region
  description: The table is about geographical regions. It contains details of different
    world regions. Each region has a unique key, a name, and a comment. The regions
    included are Africa, America, Asia, Europe, and Middle East. The comments seem
    to be randomly generated text without specific meaning related to the regions.
  columns:
  - name: region_id
    description: Unique identifier for each region
    tests:
    - not_null
    - unique
    cocoon_meta:
      unique_reason: This column represents a unique identifier for each region. For
        this table, each row represents a distinct geographical region. The region_id
        appears to be unique across rows, as it's a standard practice to use unique
        identifiers for such data.
      data_type:
        current_data_type: INT
  - name: region_name
    description: Name of the geographical region
    tests:
    - not_null
    - unique
    - accepted_values:
        values:
        - AFRICA
        - AMERICA
        - ASIA
        - EUROPE
        - MIDDLE EAST
    cocoon_meta:
      unique_reason: This column contains the name of each geographical region. For
        this table, each row represents a distinct region, and the names provided
        (Africa, America, Asia, Europe, Middle East) are unique. However, in a larger
        dataset, there could potentially be regions with the same name (e.g., North
        America and South America both containing "America"), so it's not guaranteed
        to be unique.
      future_accepted_values:
      - OCEANIA
      - ANTARCTICA
      data_type:
        current_data_type: VARCHAR
  - name: random_comment
    description: Random text associated with each region
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: VARCHAR
  tests: []

stg_lineitem

Table Preview (first 5 rows)

shipping_instructions supplier_id extended_price part_id comments order_id quantity return_flag line_status line_number tax_rate discount_rate shipping_mode commit_date receipt_date ship_date
0 DELIVER IN PERSON 93 24710.35 1552 None 1 17 N O 1 0.02 0.04 TRUCK 1996-02-12 1996-03-22 1996-03-13
1 TAKE BACK RETURN 75 56688.00 674 None 1 36 N O 2 0.06 0.09 MAIL 1996-02-28 1996-04-20 1996-04-12
2 TAKE BACK RETURN 99 37850.00 1 None 2 38 N O 1 0.05 0.00 RAIL 1997-01-14 1997-02-02 1997-01-28
3 None 56 8376.00 355 None 3 8 R F 1 0.02 0.07 AIR 1994-01-04 1994-02-23 1994-02-02
4 DELIVER IN PERSON 79 47352.00 278 None 3 42 R F 2 0.04 0.08 TRUCK 1994-02-09 1994-01-27 1994-01-13

stg_lineitem.sql (clean the table)

-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:23:44.499443+00:00
WITH 
"lineitem_renamed" AS (
    -- Rename: Renaming columns
    -- L_ORDERKEY -> order_id
    -- L_PARTKEY -> part_id
    -- L_SUPPKEY -> supplier_id
    -- L_LINENUMBER -> line_number
    -- L_QUANTITY -> quantity
    -- L_EXTENDEDPRICE -> extended_price
    -- L_DISCOUNT -> discount_rate
    -- L_TAX -> tax_rate
    -- L_RETURNFLAG -> return_flag
    -- L_LINESTATUS -> line_status
    -- L_SHIPDATE -> ship_date
    -- L_COMMITDATE -> commit_date
    -- L_RECEIPTDATE -> receipt_date
    -- L_SHIPINSTRUCT -> shipping_instructions
    -- L_SHIPMODE -> shipping_mode
    -- L_COMMENT -> comments
    SELECT 
        "L_ORDERKEY" AS "order_id",
        "L_PARTKEY" AS "part_id",
        "L_SUPPKEY" AS "supplier_id",
        "L_LINENUMBER" AS "line_number",
        "L_QUANTITY" AS "quantity",
        "L_EXTENDEDPRICE" AS "extended_price",
        "L_DISCOUNT" AS "discount_rate",
        "L_TAX" AS "tax_rate",
        "L_RETURNFLAG" AS "return_flag",
        "L_LINESTATUS" AS "line_status",
        "L_SHIPDATE" AS "ship_date",
        "L_COMMITDATE" AS "commit_date",
        "L_RECEIPTDATE" AS "receipt_date",
        "L_SHIPINSTRUCT" AS "shipping_instructions",
        "L_SHIPMODE" AS "shipping_mode",
        "L_COMMENT" AS "comments"
    FROM "memory"."main"."lineitem"
),

"lineitem_renamed_trimmed" AS (
    -- Trim Leading and Trailing Spaces
    SELECT
        "order_id",
        "part_id",
        "supplier_id",
        "line_number",
        "quantity",
        "extended_price",
        "discount_rate",
        "tax_rate",
        "return_flag",
        "line_status",
        "ship_date",
        "commit_date",
        "receipt_date",
        "shipping_instructions",
        "shipping_mode",
        TRIM("comments") AS "comments"
    FROM "lineitem_renamed"
),

"lineitem_renamed_trimmed_cleaned" AS (
    -- Clean unusual string values: 
    -- comments: The problem is that all the values in the comments column are incomplete phrases or sentence fragments without clear meaning or context. They appear to be truncated or partial sentences, making them difficult to interpret or use meaningfully. Since there's no clear pattern or way to reconstruct the full sentences, and the fragments don't carry any coherent meaning on their own, the best approach is to map all these unusual values to an empty string.
    SELECT
        "order_id",
        "part_id",
        "supplier_id",
        "line_number",
        "quantity",
        "extended_price",
        "discount_rate",
        "tax_rate",
        "return_flag",
        "line_status",
        "ship_date",
        "commit_date",
        "receipt_date",
        "shipping_instructions",
        "shipping_mode",
        CASE
            WHEN "comments" = 'blithely bold excuses wake fluffily' THEN NULL
            WHEN "comments" = 'egular courts above the' THEN NULL
            WHEN "comments" = 'ly final dependencies: slyly bold' THEN NULL
            WHEN "comments" = 'ongside of the furiously brave acco' THEN NULL
            WHEN "comments" = 'ven requests. deposits breach a' THEN NULL
            ELSE "comments"
        END AS "comments"
    FROM "lineitem_renamed_trimmed"
),

"lineitem_renamed_trimmed_cleaned_null" AS (
    -- NULL Imputation: Impute Null to Disguised Missing Values
    -- shipping_instructions: ['NONE']
    SELECT 
        CASE
            WHEN "shipping_instructions" = 'NONE' THEN NULL
            ELSE "shipping_instructions"
        END AS "shipping_instructions",
        "receipt_date",
        "supplier_id",
        "extended_price",
        "part_id",
        "comments",
        "order_id",
        "ship_date",
        "quantity",
        "return_flag",
        "line_status",
        "line_number",
        "tax_rate",
        "commit_date",
        "discount_rate",
        "shipping_mode"
    FROM "lineitem_renamed_trimmed_cleaned"
),

"lineitem_renamed_trimmed_cleaned_null_casted" AS (
    -- Column Type Casting: 
    -- commit_date: from VARCHAR to DATE
    -- receipt_date: from VARCHAR to DATE
    -- ship_date: from VARCHAR to DATE
    SELECT
        "shipping_instructions",
        "supplier_id",
        "extended_price",
        "part_id",
        "comments",
        "order_id",
        "quantity",
        "return_flag",
        "line_status",
        "line_number",
        "tax_rate",
        "discount_rate",
        "shipping_mode",
        CAST("commit_date" AS DATE) 
        AS "commit_date",
        CAST("receipt_date" AS DATE) 
        AS "receipt_date",
        CAST("ship_date" AS DATE) 
        AS "ship_date"
    FROM "lineitem_renamed_trimmed_cleaned_null"
)

-- COCOON BLOCK END
SELECT *
FROM "lineitem_renamed_trimmed_cleaned_null_casted"

stg_lineitem.yml (Document the table)

version: 2
models:
- name: stg_lineitem
  description: The table is about order line items. It contains details such as order
    key, part key, supplier key, quantity, price, discount, tax, shipping dates, and
    status. Each row represents a specific item within an order, including its shipping
    instructions, mode, and comments. The table provides a comprehensive view of individual
    items ordered, their pricing, and logistics information.
  columns:
  - name: shipping_instructions
    description: Special shipping instructions
    tests:
    - accepted_values:
        values:
        - DELIVER IN PERSON
        - TAKE BACK RETURN
    cocoon_meta:
      missing_reason: Unknown
      future_accepted_values:
      - SIGNATURE REQUIRED
      - LEAVE AT DOOR
      - CALL BEFORE DELIVERY
      - DO NOT LEAVE UNATTENDED
      - FRAGILE - HANDLE WITH CARE
      - REFRIGERATE UPON ARRIVAL
      - KEEP DRY
      - THIS SIDE UP
      - PERISHABLE
      - HAZARDOUS MATERIALS
      - INSURANCE REQUIRED
      - APPOINTMENT NECESSARY
      - NO WEEKEND DELIVERY
      data_type:
        current_data_type: VARCHAR
  - name: supplier_id
    description: Identifier for the supplier
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: extended_price
    description: Total price for the line item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: part_id
    description: Identifier for the part/product
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: comments
    description: Additional comments about the line item
    cocoon_meta:
      missing_reason: Optional field for additional information, not always needed.
      data_type:
        current_data_type: VARCHAR
  - name: order_id
    description: Unique identifier for the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: quantity
    description: Quantity of the item ordered
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: return_flag
    description: Return status flag
    tests:
    - not_null
    - accepted_values:
        values:
        - N
        - R
    cocoon_meta:
      future_accepted_values:
      - P
      - A
      - C
      data_type:
        current_data_type: VARCHAR
  - name: line_status
    description: Status of the line item
    tests:
    - not_null
    - accepted_values:
        values:
        - O
        - F
    cocoon_meta:
      future_accepted_values:
      - C
      - P
      - B
      - R
      - S
      - H
      data_type:
        current_data_type: VARCHAR
  - name: line_number
    description: Line number of the item within the order
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: INT
  - name: tax_rate
    description: Tax rate applied to the item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: discount_rate
    description: Discount rate applied to the item
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DECIMAL
  - name: shipping_mode
    description: Mode of shipping
    tests:
    - not_null
    - accepted_values:
        values:
        - TRUCK
        - AIR
        - MAIL
        - RAIL
    cocoon_meta:
      future_accepted_values:
      - SHIP
      - SEA
      - EXPRESS
      - FREIGHT
      - COURIER
      data_type:
        current_data_type: VARCHAR
  - name: commit_date
    description: Date the item was committed for shipping
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DATE
  - name: receipt_date
    description: Date the item was received
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DATE
  - name: ship_date
    description: Date the item was shipped
    tests:
    - not_null
    cocoon_meta:
      data_type:
        current_data_type: DATE
  tests: []
We identify the primary key (PK) and foreign key (FK) from tables. We build a join graph that connects FK to PK.

Join Graph (FK to PK)

%3 stg_customer_0 stg_customer stg_nation_1 stg_nation stg_customer_0->stg_nation_1 stg_region_3 stg_region stg_nation_1->stg_region_3 stg_lineitem_2 stg_lineitem stg_order_4 stg_order stg_lineitem_2->stg_order_4 stg_supplier_5 stg_supplier stg_lineitem_2->stg_supplier_5 stg_part_6 stg_part stg_lineitem_2->stg_part_6 stg_order_4->stg_customer_0 stg_supplier_5->stg_nation_1 stg_partsupp_7 stg_partsupp stg_partsupp_7->stg_supplier_5 stg_partsupp_7->stg_part_6

cocoon_join.yml (Document the joins)

join_graph:
- table_name: stg_nation
  primary_key: nation_id
  foreign_keys:
  - column: region_id
    reference:
      table_name: stg_region
      column: region_id
- table_name: stg_customer
  foreign_keys:
  - column: nation_id
    reference:
      table_name: stg_nation
      column: nation_id
    referential_integrity:
      Orphan: 100.00%
      explanation: ''
  primary_key: customer_id
- table_name: stg_supplier
  foreign_keys:
  - column: nation_id
    reference:
      table_name: stg_nation
      column: nation_id
    referential_integrity:
      Orphan: 50.00%
      explanation: ''
  primary_key: supplier_id
- table_name: stg_order
  primary_key: order_id
  foreign_keys:
  - column: customer_id
    reference:
      table_name: stg_customer
      column: customer_id
  time_keys:
  - order_date
- table_name: stg_lineitem
  foreign_keys:
  - column: order_id
    reference:
      table_name: stg_order
      column: order_id
  - column: part_id
    reference:
      table_name: stg_part
      column: part_id
    referential_integrity:
      Orphan: 80.00%
      explanation: ''
  - column: supplier_id
    reference:
      table_name: stg_supplier
      column: supplier_id
  time_keys:
  - commit_date
  - receipt_date
  - ship_date
- table_name: stg_part
  primary_key: part_id
  foreign_keys: []
- table_name: stg_partsupp
  foreign_keys:
  - column: part_key
    reference:
      table_name: stg_part
      column: part_id
  - column: supplier_key
    reference:
      table_name: stg_supplier
      column: supplier_id
- table_name: stg_region
  primary_key: region_id
  foreign_keys: []
We illustrate the step-by-step process behind the data.

All

Process Story

cocoon_er.yml (Document the entity relationships)

groups: []
entities:
- entity_name: Nations
  entity_description: Represents countries with their unique identifiers, names, region
    associations, and brief descriptions.
  table_name: stg_nation
  primary_key: nation_id
- entity_name: Orders
  entity_description: Represents customer orders with details such as order ID, customer
    information, financial data, dates, and processing information.
  table_name: stg_order
  primary_key: order_id
- entity_name: Parts
  entity_description: Represents product parts with their unique identifiers, names,
    specifications, pricing, and other attributes.
  table_name: stg_part
  primary_key: part_id
- entity_name: Suppliers
  entity_description: Represents suppliers with their unique identifiers, contact
    information, location details, and financial data.
  table_name: stg_supplier
  primary_key: supplier_id
- entity_name: Regions
  entity_description: Represents geographical regions with their unique identifiers,
    names, and associated comments.
  table_name: stg_region
  primary_key: region_id
- entity_name: Customers
  entity_description: Represents customers with their unique identifiers, contact
    information, financial data, market segments, and other relevant details.
  table_name: stg_customer
  primary_key: customer_id
relations:
- relation_name: NationRegionAssociation
  relation_description: Nations belong to specific Regions, with each Nation associated
    with one Region through a region_id.
  table_name: stg_nation
  entities:
  - Nations
  - Regions
- relation_name: CustomerOrders
  relation_description: This captures the Orders placed by Customers, including order
    details, status, and pricing information.
  table_name: stg_order
  entities:
  - Orders
  - Customers
- relation_name: SupplierNationAffiliation
  relation_description: Suppliers are associated with Nations, indicating the country
    where each supplier is based or operates from.
  table_name: stg_supplier
  entities:
  - Suppliers
  - Nations
- relation_name: CustomerNationAssociation
  relation_description: This stores the Customers and their associated Nations, where
    each Customer belongs to a specific Nation.
  table_name: stg_customer
  entities:
  - Customers
  - Nations
- relation_name: OrderLineItems
  relation_description: Suppliers provide Parts that are included in Orders as line
    items with specific quantities, prices, and shipping details.
  table_name: stg_lineitem
  entities:
  - Orders
  - Parts
  - Suppliers
- relation_name: PartSupplierInventory
  relation_description: This represents the availability and pricing of Parts from
    specific Suppliers, including quantity and supply cost details.
  table_name: stg_partsupp
  entities:
  - Parts
  - Suppliers
story:
- name: NationRegionAssociation
  description: Nations group into regions for economic cooperation.
  type: relation
- name: SupplierNationAffiliation
  description: Suppliers register their operations in specific nations.
  type: relation
- name: CustomerNationAssociation
  description: Customers provide their national location for business.
  type: relation
- name: PartSupplierInventory
  description: Suppliers list available parts with prices and quantities.
  type: relation
- name: CustomerOrders
  description: Customers place orders for desired products.
  type: relation
- name: OrderLineItems
  description: Orders detail parts, quantities, and chosen suppliers.
  type: relation