customer
Table Preview(first 5 rows)
C_CUSTKEY | C_NAME | C_ADDRESS | C_NATIONKEY | C_PHONE | C_ACCTBAL | C_MKTSEGMENT | C_COMMENT | |
---|---|---|---|---|---|---|---|---|
0 | IVhzIApeRb ot | c | E | 15 | 9-567-846-8931 | 711.56 | HOUSEHOLD | regular deposits detect blithely agai |
1 | XSTf4 | NCwDVaWNe6tEgvwfmRchLXak | 13 | 23-768-687-3665 | 121.65 | AUTOMOBILE | pending accounts haggle furiously deposit | None |
2 | MG9kdTD2WBHm | 1 | 11-719-748-3364 | 7498.12 | AUTOMOBILE | special packages hang ironic | silent packages | None |
3 | XxVSJsLAGtn | 4 | 14-128-190-5944 | 2866.83 | MACHINERY | pending requests wake carefully express | even accounts | None |
4 | KvpyuHCplrB84WgAiGV6sYpZq7Tj | 3 | 13-750-942-6364 | 794.47 | HOUSEHOLD | accounts wake furiously even instructions | None | None |
lineitem
Table Preview(first 5 rows)
L_ORDERKEY | L_PARTKEY | L_SUPPKEY | L_LINENUMBER | L_QUANTITY | L_EXTENDEDPRICE | L_DISCOUNT | L_TAX | L_RETURNFLAG | L_LINESTATUS | L_SHIPDATE | L_COMMITDATE | L_RECEIPTDATE | L_SHIPINSTRUCT | L_SHIPMODE | L_COMMENT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1552 | 93 | 1 | 17 | 24710.35 | 0.04 | 0.02 | N | O | 1996-03-13 | 1996-02-12 | 1996-03-22 | DELIVER IN PERSON | TRUCK | egular courts above the |
1 | 1 | 674 | 75 | 2 | 36 | 56688.00 | 0.09 | 0.06 | N | O | 1996-04-12 | 1996-02-28 | 1996-04-20 | TAKE BACK RETURN | ly final dependencies: slyly bold | |
2 | 2 | 1 | 99 | 1 | 38 | 37850.00 | 0.00 | 0.05 | N | O | 1997-01-28 | 1997-01-14 | 1997-02-02 | TAKE BACK RETURN | RAIL | ven requests. deposits breach a |
3 | 3 | 355 | 56 | 1 | 8 | 8376.00 | 0.07 | 0.02 | R | F | 1994-02-02 | 1994-01-04 | 1994-02-23 | NONE | AIR | ongside of the furiously brave acco |
4 | 3 | 278 | 79 | 2 | 42 | 47352.00 | 0.08 | 0.04 | R | F | 1994-01-13 | 1994-02-09 | 1994-01-27 | DELIVER IN PERSON | TRUCK | blithely bold excuses wake fluffily |
nation
Table Preview(first 5 rows)
N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT | |
---|---|---|---|---|
0 | 0 | ALGERIA | 0 | final accounts wake quickly. special reques |
1 | 1 | ARGENTINA | 1 | idly final instructions cajole stealthily. regular instructions wake carefully blithely express accounts. fluffily bold excuses |
2 | 2 | BRAZIL | 1 | always pending pinto beans sleep sil |
3 | 3 | CANADA | 1 | foxes among the bold requests ideally blithely final dependencies. furiously express accounts wake blithely |
4 | 4 | EGYPT | 4 | fluffily unusual packages wake blithe |
order
Table Preview(first 5 rows)
O_ORDERKEY | O_CUSTKEY | O_ORDERSTATUS | O_TOTALPRICE | O_ORDERDATE | O_ORDERPRIORITY | O_CLERK | O_SHIPPRIORITY | O_COMMENT | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 370 | O | 172799.49 | 1996-01-02 | 5-LOW | Clerk#000000951 | 0 | nstructions sleep furiously among |
1 | 2 | 781 | O | 38426.09 | 1996-12-01 | 1-URGENT | Clerk#000000880 | 0 | foxes. pending accounts at the pending silent asymptot |
2 | 3 | 1234 | F | 205654.30 | 1993-10-14 | 5-LOW | Clerk#000000955 | 0 | sly final accounts boost. carefully regular ideas cajole carefully. depos |
3 | 4 | 1369 | O | 56000.91 | 1995-10-11 | 5-LOW | Clerk#000000124 | 0 | sits. slyly regular warthogs cajole. regular regular theodolites acro |
4 | 5 | 445 | F | 105367.67 | 1994-07-30 | 5-LOW | Clerk#000000925 | 0 | quickly. bold deposits sleep slyly. packages use slyly |
part
Table Preview(first 5 rows)
P_PARTKEY | P_NAME | P_MFGR | P_BRAND | P_TYPE | P_SIZE | P_CONTAINER | P_RETAILPRICE | P_COMMENT | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901.0 | ly. slyly ironi |
1 | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902.0 | lar accounts amo |
2 | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903.0 | egular deposits hag |
3 | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904.0 | p furiously r |
4 | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905.0 | wake carefully |
partsupp
Table Preview(first 5 rows)
PS_PARTKEY | PS_SUPPKEY | PS_AVAILQTY | PS_SUPPLYCOST | PS_COMMENT | |
---|---|---|---|---|---|
0 | 1 | 1 | 3325 | 771.64 | final theodolites. special packages above the quickly ironic waters cajo |
1 | 1 | 2 | 8076 | 993.49 | ven ideas. quickly even packages print. pending multipliers |
2 | 2 | 3 | 3956 | 337.09 | fluffily ironic deposits are quickly pending accounts. furiously final foxes are slyly. silent requests sleep quickly ironic ideas |
3 | 2 | 4 | 4069 | 357.84 | carefully regular excuses. carefully silent pinto beans sleep. quickly even deposits sleep quickly furiously pending dolphins |
4 | 3 | 5 | 4673 | 647.59 | quickly special packages sleep carefully above the ironic deposits |
region
Table Preview(first 5 rows)
R_REGIONKEY | R_NAME | R_COMMENT | |
---|---|---|---|
0 | 0 | AFRICA | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1 | 1 | AMERICA | hs use ironic even requests. s |
2 | 2 | ASIA | ges. thinly even pinto beans ca |
3 | 3 | EUROPE | ly final courts cajole furiously final excuse |
4 | 4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl |
supplier
Table Preview(first 5 rows)
S_SUPPKEY | S_NAME | S_ADDRESS | S_NATIONKEY | S_PHONE | S_ACCTBAL | S_COMMENT | |
---|---|---|---|---|---|---|---|
0 | Supplier#000000001 | N kD4on9OM Ipw3 | gf0JBoQDd7tgrzrddZ | 17 | 27-918-335-1736 | 5755.94 | each slyly above the careful |
1 | Supplier#000000002 | 89eJ5ksX3ImxJQBvxObC | 5 | 15-679-861-2259 | 4032.68 | slyly bold instructions. idle deposi | None |
2 | Supplier#000000003 | q1 | G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3 | 1 | 11-383-516-1199 | 4192.40 | blithely silent requests after the express dependencies are sl |
3 | Supplier#000000004 | Bk7ah4CK8SYQTepEmvMkkgMwg | 15 | 25-843-787-7479 | 4641.08 | riously even requests above the exp | None |
4 | Supplier#000000005 | Gcdm2rJRzl5qlTVzc | 4 | 14-606-487-0570 | 9915.24 | quickly above the quickly ironic deposits affix | None |
stg_partsupp
Table Preview (first 5 rows)
part_key | supplier_key | available_quantity | supply_cost | comment | |
---|---|---|---|---|---|
0 | 1 | 1 | 3325 | 771.64 | final theodolites. special packages above the quickly ironic waters cajo |
1 | 1 | 2 | 8076 | 993.49 | ven ideas. quickly even packages print. pending multipliers |
2 | 2 | 3 | 3956 | 337.09 | fluffily ironic deposits are quickly pending accounts. furiously final foxes are slyly. silent requests sleep quickly ironic ideas |
3 | 2 | 4 | 4069 | 357.84 | carefully regular excuses. carefully silent pinto beans sleep. quickly even deposits sleep quickly furiously pending dolphins |
4 | 3 | 5 | 4673 | 647.59 | quickly special packages sleep carefully above the ironic deposits |
stg_partsupp.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:26:38.079237+00:00
WITH
"partsupp_renamed" AS (
-- Rename: Renaming columns
-- PS_PARTKEY -> part_key
-- PS_SUPPKEY -> supplier_key
-- PS_AVAILQTY -> available_quantity
-- PS_SUPPLYCOST -> supply_cost
-- PS_COMMENT -> comment
SELECT
"PS_PARTKEY" AS "part_key",
"PS_SUPPKEY" AS "supplier_key",
"PS_AVAILQTY" AS "available_quantity",
"PS_SUPPLYCOST" AS "supply_cost",
"PS_COMMENT" AS "comment"
FROM "memory"."main"."partsupp"
)
-- COCOON BLOCK END
SELECT *
FROM "partsupp_renamed"
stg_partsupp.yml (Document the table)
version: 2
models:
- name: stg_partsupp
description: The table is the PartSupplier relation between parts and suppliers.
It contains information about the availability and cost of parts from specific
suppliers. Each row represents a part-supplier combination with details like available
quantity, supply cost, and a comment. The primary keys are PS_PARTKEY and PS_SUPPKEY,
linking to the Part and Supplier tables respectively.
columns:
- name: part_key
description: Part key, links to Part table
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: supplier_key
description: Supplier key, links to Supplier table
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: available_quantity
description: Available quantity of the part from the supplier
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: supply_cost
description: Cost of the part from the supplier
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: comment
description: Additional comments or notes about the part-supplier relationship
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
tests: []
stg_part
Table Preview (first 5 rows)
part_id | part_name | manufacturer | brand | part_type | size_ | packaging | retail_price | comments | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901.0 | None |
1 | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902.0 | None |
2 | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903.0 | None |
3 | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904.0 | None |
4 | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905.0 | None |
stg_part.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:26:17.331735+00:00
WITH
"part_renamed" AS (
-- Rename: Renaming columns
-- P_PARTKEY -> part_id
-- P_NAME -> part_name
-- P_MFGR -> manufacturer
-- P_BRAND -> brand
-- P_TYPE -> part_type
-- P_SIZE -> size_
-- P_CONTAINER -> packaging
-- P_RETAILPRICE -> retail_price
-- P_COMMENT -> comments
SELECT
"P_PARTKEY" AS "part_id",
"P_NAME" AS "part_name",
"P_MFGR" AS "manufacturer",
"P_BRAND" AS "brand",
"P_TYPE" AS "part_type",
"P_SIZE" AS "size_",
"P_CONTAINER" AS "packaging",
"P_RETAILPRICE" AS "retail_price",
"P_COMMENT" AS "comments"
FROM "memory"."main"."part"
),
"part_renamed_cleaned" AS (
-- Clean unusual string values:
-- comments: The problem is that all values in the comments column are incomplete fragments of sentences, making their meaning unclear. These fragments appear to be cut-off portions of longer comments. Without the full context, it's impossible to determine the intended complete sentences. The correct approach is to treat these as corrupted data and map them to empty strings, as we cannot accurately reconstruct the original comments.
SELECT
"part_id",
"part_name",
"manufacturer",
"brand",
"part_type",
"size_",
"packaging",
"retail_price",
CASE
WHEN "comments" = 'egular deposits hag' THEN NULL
WHEN "comments" = 'lar accounts amo' THEN NULL
WHEN "comments" = 'ly. slyly ironi' THEN NULL
WHEN "comments" = 'p furiously r' THEN NULL
WHEN "comments" = 'wake carefully' THEN NULL
ELSE "comments"
END AS "comments"
FROM "part_renamed"
)
-- COCOON BLOCK END
SELECT *
FROM "part_renamed_cleaned"
stg_part.yml (Document the table)
version: 2
models:
- name: stg_part
description: The table is about product parts. It includes details such as part
key, name, manufacturer, brand, type, size, container, retail price, and a comment.
Each row represents a unique part with its specific attributes. The parts vary
in characteristics like color, material, and packaging type. Pricing seems to
increase slightly with each part key.
columns:
- name: part_id
description: Unique identifier for the part
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each part. For
this table, each row represents a unique part. part_id is unique across rows
and incrementing, suggesting it's designed to be a primary key.
data_type:
current_data_type: INT
- name: part_name
description: Descriptive name of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: manufacturer
description: Manufacturer of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: brand
description: Brand of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: part_type
description: Type and material of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: size_
description: Size of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: packaging
description: Container or packaging type for the part
tests:
- not_null
- accepted_values:
values:
- JUMBO PKG
- LG CASE
- MED DRUM
- SM PKG
- WRAP CASE
cocoon_meta:
future_accepted_values:
- XL DRUM
- PALLET
- CRATE
- BOX
- CARTON
- TUBE
- BAG
- ROLL
- BUNDLE
data_type:
current_data_type: VARCHAR
- name: retail_price
description: Retail price of the part
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: comments
description: Additional comments about the part
cocoon_meta:
missing_reason: Optional field for additional information, not always needed.
data_type:
current_data_type: VARCHAR
tests: []
stg_nation
Table Preview (first 5 rows)
nation_id | country_name | region_id | country_description | |
---|---|---|---|---|
0 | 0 | ALGERIA | 0 | final accounts wake quickly. special reques |
1 | 1 | ARGENTINA | 1 | idly final instructions cajole stealthily. regular instructions wake carefully blithely express accounts. fluffily bold excuses |
2 | 2 | BRAZIL | 1 | always pending pinto beans sleep sil |
3 | 3 | CANADA | 1 | foxes among the bold requests ideally blithely final dependencies. furiously express accounts wake blithely |
4 | 4 | EGYPT | 4 | fluffily unusual packages wake blithe |
stg_nation.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:24:16.330008+00:00
WITH
"nation_renamed" AS (
-- Rename: Renaming columns
-- N_NATIONKEY -> nation_id
-- N_NAME -> country_name
-- N_REGIONKEY -> region_id
-- N_COMMENT -> country_description
SELECT
"N_NATIONKEY" AS "nation_id",
"N_NAME" AS "country_name",
"N_REGIONKEY" AS "region_id",
"N_COMMENT" AS "country_description"
FROM "memory"."main"."nation"
)
-- COCOON BLOCK END
SELECT *
FROM "nation_renamed"
stg_nation.yml (Document the table)
version: 2
models:
- name: stg_nation
description: The table is about nations, with details of each country. It includes
a unique nation key, the country's name, a region key (likely linking to a separate
region table), and a brief comment about the country. The comment seems to be
a random phrase or description, possibly for testing or placeholder purposes.
columns:
- name: nation_id
description: Unique identifier for each nation
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each nation. For
this table, each row represents a distinct country, and nation_id appears
to be unique across rows.
data_type:
current_data_type: INT
- name: country_name
description: Name of the country
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column contains the name of each country. While country
names are generally unique in the real world, there could potentially be duplicate
names or variations. However, based on the given sample, it appears to be
unique.
data_type:
current_data_type: VARCHAR
- name: region_id
description: Foreign key linking to a region table
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: country_description
description: Brief, seemingly random description about the country
tests:
- not_null
cocoon_meta:
unusual_values: Values are unusual because they are incomplete sentences, likely
truncated from longer descriptions.
data_type:
current_data_type: VARCHAR
tests: []
stg_customer
Table Preview (first 5 rows)
customer_id | customer_name | address | phone_number | comments | market_segment | account_balance | nation_id | |
---|---|---|---|---|---|---|---|---|
0 | IVhzIApeRbot | None | None | 9-567-846-8931 | regular deposits detect blithely agai | HOUSEHOLD | 711.56 | 15.0 |
1 | XSTf4000000000000000000000000 | None | None | None | None | None | NaN | NaN |
2 | MG9kdTD2WBHm0000000000000000 | None | 11-719-748-3364 | None | None | None | NaN | NaN |
3 | XxVSJsLAGtn00000000000000000 | None | 14-128-190-5944 | None | None | None | NaN | NaN |
4 | KvpyuHCplrB84WgAiGV6sYpZq7Tj | None | 13-750-942-6364 | None | None | None | NaN | NaN |
stg_customer.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:22:15.794717+00:00
WITH
"customer_renamed" AS (
-- Rename: Renaming columns
-- C_CUSTKEY -> customer_id
-- C_NAME -> customer_name
-- C_ADDRESS -> address
-- C_NATIONKEY -> nation_id
-- C_PHONE -> phone_number
-- C_ACCTBAL -> account_balance
-- C_MKTSEGMENT -> market_segment
-- C_COMMENT -> comments
SELECT
"C_CUSTKEY" AS "customer_id",
"C_NAME" AS "customer_name",
"C_ADDRESS" AS "address",
"C_NATIONKEY" AS "nation_id",
"C_PHONE" AS "phone_number",
"C_ACCTBAL" AS "account_balance",
"C_MKTSEGMENT" AS "market_segment",
"C_COMMENT" AS "comments"
FROM "memory"."main"."customer"
),
"customer_renamed_trimmed" AS (
-- Trim Leading and Trailing Spaces
SELECT
"customer_id",
"customer_name",
"address",
"nation_id",
"phone_number",
"account_balance",
"comments",
TRIM("market_segment") AS "market_segment"
FROM "customer_renamed"
),
"customer_renamed_trimmed_cleaned" AS (
-- Clean unusual string values:
-- customer_id: The problem is that the customer_id values have inconsistent patterns and lengths. The most unusual value is 'IVhzIApeRb ot' because it contains a space, which is atypical for customer IDs. The other values have varying lengths and character patterns. The correct values should follow a consistent format, ideally without spaces and with a standard length. Since 'KvpyuHCplrB84WgAiGV6sYpZq7Tj' is the longest and most complex ID, it appears to be the most likely format for a proper customer ID.
-- customer_name: The problem is that the customer_name column contains values that are not typical names. The values '1', '3', '4', and 'c' appear to be placeholders or data entry errors. 'NCwDVaWNe6tEgvwfmRchLXak' looks like a random string, possibly an ID that was mistakenly placed in the name column. None of these are valid customer names. The correct approach would be to replace these invalid entries with an empty string, indicating missing data, as we don't have enough information to determine the actual customer names.
-- address: The problem is that two values ('E' and '13') do not follow the phone number format of the other entries. The correct values should be in the format XX-XXX-XXX-XXXX, where X represents digits. The value 'E' is completely invalid as a phone number, while '13' is likely a partial or incorrect entry.
-- nation_id: The nation_id column contains inconsistent data types and formats. '15' appears to be a valid numeric ID, while '23-768-687-3665' looks like a phone number, and '2866.83', '7498.12', '794.47' are decimal numbers. These inconsistencies suggest data entry errors or misplaced information from other columns. Since '15' is the only value that resembles a typical integer ID format and is the most frequent, we'll consider it as the correct format for nation_id. The other values should be treated as errors and mapped to an empty string, as they don't contain valid nation ID information.
-- phone_number: The problem is that the phone_number column contains values that are not phone numbers. The value '121.65' appears to be a decimal number, while 'AUTOMOBILE', 'HOUSEHOLD', and 'MACHINERY' seem to be category names, possibly from another column. Only '9-567-846-8931' resembles a valid phone number format. The correct values should all be phone numbers or empty strings if the actual phone number is unknown.
-- account_balance: The problem is that the account_balance column contains non-numeric values which are inappropriate for an account balance. The only correct value is '711.56', which is a numeric balance. The other values are either a vehicle type ('AUTOMOBILE') or nonsensical sentences, which have no place in an account balance field. The correct values should all be numeric account balances or empty strings if the balance is unknown.
-- market_segment: The problem is that the market_segment column contains some unusual and grammatically incorrect values. 'HOUSEHOLD' appears to be a valid market segment descriptor, while 'even accounts', 'pending accounts haggle furiously deposit', and 'silent packages' are not typical market segment descriptors and contain grammatical issues or nonsensical phrases. The correct values should be standardized market segment descriptors. Since 'HOUSEHOLD' is the only clear and valid segment, we'll map the unusual values to empty strings, as we don't have enough information to accurately categorize them into other segments.
SELECT
CASE
WHEN "customer_id" = 'IVhzIApeRb ot' THEN 'IVhzIApeRbot'
WHEN "customer_id" = 'XSTf4' THEN 'XSTf4000000000000000000000000'
WHEN "customer_id" = 'XxVSJsLAGtn' THEN 'XxVSJsLAGtn00000000000000000'
WHEN "customer_id" = 'MG9kdTD2WBHm' THEN 'MG9kdTD2WBHm0000000000000000'
ELSE "customer_id"
END AS "customer_id",
CASE
WHEN "customer_name" = '1' THEN NULL
WHEN "customer_name" = '3' THEN NULL
WHEN "customer_name" = '4' THEN NULL
WHEN "customer_name" = 'c' THEN NULL
WHEN "customer_name" = 'NCwDVaWNe6tEgvwfmRchLXak' THEN NULL
ELSE "customer_name"
END AS "customer_name",
CASE
WHEN "address" = 'E' THEN NULL
WHEN "address" = '13' THEN NULL
ELSE "address"
END AS "address",
CASE
WHEN "nation_id" = '23-768-687-3665' THEN NULL
WHEN "nation_id" = '2866.83' THEN NULL
WHEN "nation_id" = '7498.12' THEN NULL
WHEN "nation_id" = '794.47' THEN NULL
ELSE "nation_id"
END AS "nation_id",
CASE
WHEN "phone_number" = '121.65' THEN NULL
WHEN "phone_number" = 'AUTOMOBILE' THEN NULL
WHEN "phone_number" = 'HOUSEHOLD' THEN NULL
WHEN "phone_number" = 'MACHINERY' THEN NULL
ELSE "phone_number"
END AS "phone_number",
CASE
WHEN "account_balance" = 'AUTOMOBILE' THEN NULL
WHEN "account_balance" = 'accounts wake furiously even instructions' THEN NULL
WHEN "account_balance" = 'pending requests wake carefully express' THEN NULL
WHEN "account_balance" = 'special packages hang ironic' THEN NULL
ELSE "account_balance"
END AS "account_balance",
"comments",
CASE
WHEN "market_segment" = 'even accounts' THEN NULL
WHEN "market_segment" = 'pending accounts haggle furiously deposit' THEN NULL
WHEN "market_segment" = 'silent packages' THEN NULL
ELSE "market_segment"
END AS "market_segment"
FROM "customer_renamed_trimmed"
),
"customer_renamed_trimmed_cleaned_casted" AS (
-- Column Type Casting:
-- account_balance: from VARCHAR to DECIMAL
-- nation_id: from VARCHAR to INT
SELECT
"customer_id",
"customer_name",
"address",
"phone_number",
"comments",
"market_segment",
CAST("account_balance" AS DECIMAL)
AS "account_balance",
CAST("nation_id" AS INT)
AS "nation_id"
FROM "customer_renamed_trimmed_cleaned"
)
-- COCOON BLOCK END
SELECT *
FROM "customer_renamed_trimmed_cleaned_casted"
stg_customer.yml (Document the table)
version: 2
models:
- name: stg_customer
description: The table is about customers. It contains customer details like customer
key, name, address, nation key, phone number, account balance, market segment,
and comments. Each row represents a unique customer with their associated information.
The table appears to be part of a larger database, possibly for a business tracking
customer information across different market segments and nations.
columns:
- name: customer_id
description: Unique identifier for each customer
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each customer.
For this table, each row is for a specific customer, and customer_id is unique
across rows.
data_type:
current_data_type: VARCHAR
- name: customer_name
description: Name of the customer
cocoon_meta:
missing_reason: Customer ID is used instead of name for identification.
data_type:
current_data_type: VARCHAR
- name: address
description: Physical address of the customer
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: phone_number
description: Customer's phone number
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: comments
description: Additional comments or notes about the customer
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: market_segment
description: Market segment the customer belongs to
tests:
- accepted_values:
values:
- HOUSEHOLD
cocoon_meta:
missing_reason: Unknown
future_accepted_values:
- CORPORATE
- SMALL BUSINESS
- GOVERNMENT
- EDUCATION
- NON-PROFIT
- HEALTHCARE
- RETAIL
- MANUFACTURING
- FINANCIAL SERVICES
- TECHNOLOGY
- HOSPITALITY
- TRANSPORTATION
- ENERGY
- AGRICULTURE
- CONSTRUCTION
data_type:
current_data_type: VARCHAR
- name: account_balance
description: Customer's account balance
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: DECIMAL
- name: nation_id
description: Identifier for the customer's nation
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: INT
tests: []
stg_order
Table Preview (first 5 rows)
order_id | customer_id | order_status | total_price | order_priority | clerk_id | shipping_priority | order_comments | order_date | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 370 | O | 172799.49 | 5-LOW | Clerk#000000951 | 0 | nstructions sleep furiously among | 1996-01-02 |
1 | 2 | 781 | O | 38426.09 | 1-URGENT | Clerk#000000880 | 0 | foxes. pending accounts at the pending silent asymptot | 1996-12-01 |
2 | 3 | 1234 | F | 205654.30 | 5-LOW | Clerk#000000955 | 0 | sly final accounts boost. carefully regular ideas cajole carefully. depos | 1993-10-14 |
3 | 4 | 1369 | O | 56000.91 | 5-LOW | Clerk#000000124 | 0 | sits. slyly regular warthogs cajole. regular regular theodolites acro | 1995-10-11 |
4 | 5 | 445 | F | 105367.67 | 5-LOW | Clerk#000000925 | 0 | quickly. bold deposits sleep slyly. packages use slyly | 1994-07-30 |
stg_order.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:25:13.064323+00:00
WITH
"order_renamed" AS (
-- Rename: Renaming columns
-- O_ORDERKEY -> order_id
-- O_CUSTKEY -> customer_id
-- O_ORDERSTATUS -> order_status
-- O_TOTALPRICE -> total_price
-- O_ORDERDATE -> order_date
-- O_ORDERPRIORITY -> order_priority
-- O_CLERK -> clerk_id
-- O_SHIPPRIORITY -> shipping_priority
-- O_COMMENT -> order_comments
SELECT
"O_ORDERKEY" AS "order_id",
"O_CUSTKEY" AS "customer_id",
"O_ORDERSTATUS" AS "order_status",
"O_TOTALPRICE" AS "total_price",
"O_ORDERDATE" AS "order_date",
"O_ORDERPRIORITY" AS "order_priority",
"O_CLERK" AS "clerk_id",
"O_SHIPPRIORITY" AS "shipping_priority",
"O_COMMENT" AS "order_comments"
FROM "memory"."main"."order"
),
"order_renamed_casted" AS (
-- Column Type Casting:
-- order_date: from VARCHAR to DATE
SELECT
"order_id",
"customer_id",
"order_status",
"total_price",
"order_priority",
"clerk_id",
"shipping_priority",
"order_comments",
CAST("order_date" AS DATE)
AS "order_date"
FROM "order_renamed"
)
-- COCOON BLOCK END
SELECT *
FROM "order_renamed_casted"
stg_order.yml (Document the table)
version: 2
models:
- name: stg_order
description: The table is about orders. It includes details such as order key, customer
key, order status, total price, order date, priority, clerk, shipping priority,
and comments. Each row represents a unique order with its associated information.
The table captures various aspects of orders placed by customers, including financial
details, timing, and processing information.
columns:
- name: order_id
description: Unique identifier for each order
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each order. For
this table, each row represents a distinct order, and order_id is unique across
rows.
data_type:
current_data_type: INT
- name: customer_id
description: Customer identifier associated with the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: order_status
description: Current status of the order
tests:
- not_null
- accepted_values:
values:
- O
- F
cocoon_meta:
future_accepted_values:
- P
- S
- D
- C
- R
data_type:
current_data_type: VARCHAR
- name: total_price
description: Total price of the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: order_priority
description: Priority level of the order
tests:
- not_null
- accepted_values:
values:
- 5-LOW
- 1-URGENT
cocoon_meta:
future_accepted_values:
- 2-HIGH
- 3-MEDIUM
- 4-NORMAL
data_type:
current_data_type: VARCHAR
- name: clerk_id
description: Identifier of the clerk processing the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: shipping_priority
description: Shipping priority of the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: order_comments
description: Additional comments or notes about the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
- name: order_date
description: Date when the order was placed
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DATE
tests: []
stg_supplier
Table Preview (first 5 rows)
supplier_id | supplier_name | supplier_address | supplier_phone | supplier_comments | account_balance | nation_id | |
---|---|---|---|---|---|---|---|
0 | Supplier#000000001 | None | None | 27-918-335-1736 | None | 5755.94 | 17.0 |
1 | Supplier#000000002 | None | 5 | None | None | NaN | NaN |
2 | Supplier#000000003 | None | None | 11-383-516-1199 | None | 4192.40 | 1.0 |
3 | Supplier#000000004 | None | 15 | None | None | NaN | NaN |
4 | Supplier#000000005 | None | 4 | None | None | NaN | NaN |
stg_supplier.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:28:27.866538+00:00
WITH
"supplier_renamed" AS (
-- Rename: Renaming columns
-- S_SUPPKEY -> supplier_id
-- S_NAME -> supplier_name
-- S_ADDRESS -> supplier_address
-- S_NATIONKEY -> nation_id
-- S_PHONE -> supplier_phone
-- S_ACCTBAL -> account_balance
-- S_COMMENT -> supplier_comments
SELECT
"S_SUPPKEY" AS "supplier_id",
"S_NAME" AS "supplier_name",
"S_ADDRESS" AS "supplier_address",
"S_NATIONKEY" AS "nation_id",
"S_PHONE" AS "supplier_phone",
"S_ACCTBAL" AS "account_balance",
"S_COMMENT" AS "supplier_comments"
FROM "memory"."main"."supplier"
),
"supplier_renamed_trimmed" AS (
-- Trim Leading and Trailing Spaces
SELECT
"supplier_id",
"supplier_name",
"supplier_address",
"nation_id",
"supplier_phone",
"supplier_comments",
TRIM("account_balance") AS "account_balance"
FROM "supplier_renamed"
),
"supplier_renamed_trimmed_cleaned" AS (
-- Clean unusual string values:
-- supplier_name: The problem is that all values in the supplier_name column appear to be encrypted, hashed, or randomly generated strings. These do not resemble typical supplier names, which would usually be recognizable company or individual names. Without additional information to decode these strings, it's impossible to determine the correct supplier names. Therefore, the best approach is to map all these unusual values to an empty string, indicating that the actual supplier name is unknown or unavailable.
-- supplier_address: The problem is that the supplier_address column contains unusual values. The values '15', '4', and '5' are unusually short for addresses and likely represent incomplete data. The values 'G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3' and 'gf0JBoQDd7tgrzrddZ' appear to be random strings rather than valid addresses. These random strings are meaningless and should be replaced with empty strings to indicate missing data. The short numeric values should be kept as is, as they might represent valid partial address information (like building numbers).
-- nation_id: The problem is that some values in the nation_id column resemble phone numbers (14-606-487-0570, 15-679-861-2259, 25-843-787-7479) rather than typical nation IDs. The correct values should be short numeric codes. The most frequent valid value is '1', and there's also a '17', which follows the same pattern of being a short numeric code. We'll map the phone number-like values to empty strings as they don't seem to represent valid nation IDs.
-- supplier_phone: The problem is that the supplier_phone column contains three decimal numbers (4032.68, 4641.08, and 9915.24) which are not valid phone numbers. The correct values should be in the format of xx-xxx-xxx-xxxx, as seen in the two most frequent entries. Since these decimal numbers are meaningless in the context of phone numbers, they should be mapped to an empty string.
-- supplier_comments: The problem is that the values in the supplier_comments column are incomplete sentences that seem to be truncated. They lack proper context and meaning. The correct values should be complete sentences or meaningful phrases related to supplier comments. However, without more context or the full dataset, it's difficult to determine what the complete sentences should be.
-- account_balance: The problem is that some entries in the account_balance column contain text descriptions instead of numeric values. Account balances should typically be numeric values representing monetary amounts. The correct values are the numeric entries like '4192.40' and '5755.94'. The text entries are meaningless in this context and should be removed or replaced with empty strings.
SELECT
"supplier_id",
CASE
WHEN "supplier_name" = '89eJ5ksX3ImxJQBvxObC' THEN NULL
WHEN "supplier_name" = 'Bk7ah4CK8SYQTepEmvMkkgMwg' THEN NULL
WHEN "supplier_name" = 'Gcdm2rJRzl5qlTVzc' THEN NULL
WHEN "supplier_name" = 'N kD4on9OM Ipw3' THEN NULL
WHEN "supplier_name" = 'q1' THEN NULL
ELSE "supplier_name"
END AS "supplier_name",
CASE
WHEN "supplier_address" = 'G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3' THEN NULL
WHEN "supplier_address" = 'gf0JBoQDd7tgrzrddZ' THEN NULL
ELSE "supplier_address"
END AS "supplier_address",
CASE
WHEN "nation_id" = '14-606-487-0570' THEN NULL
WHEN "nation_id" = '15-679-861-2259' THEN NULL
WHEN "nation_id" = '25-843-787-7479' THEN NULL
ELSE "nation_id"
END AS "nation_id",
CASE
WHEN "supplier_phone" = '4032.68' THEN NULL
WHEN "supplier_phone" = '4641.08' THEN NULL
WHEN "supplier_phone" = '9915.24' THEN NULL
ELSE "supplier_phone"
END AS "supplier_phone",
CASE
WHEN "supplier_comments" = 'blithely silent requests after the express dependencies are sl' THEN NULL
WHEN "supplier_comments" = 'each slyly above the careful' THEN NULL
ELSE "supplier_comments"
END AS "supplier_comments",
CASE
WHEN "account_balance" = 'quickly above the quickly ironic deposits affix' THEN NULL
WHEN "account_balance" = 'riously even requests above the exp' THEN NULL
WHEN "account_balance" = 'slyly bold instructions. idle deposi' THEN NULL
ELSE "account_balance"
END AS "account_balance"
FROM "supplier_renamed_trimmed"
),
"supplier_renamed_trimmed_cleaned_casted" AS (
-- Column Type Casting:
-- account_balance: from VARCHAR to DECIMAL
-- nation_id: from VARCHAR to INT
SELECT
"supplier_id",
"supplier_name",
"supplier_address",
"supplier_phone",
"supplier_comments",
CAST("account_balance" AS DECIMAL)
AS "account_balance",
CAST("nation_id" AS INT)
AS "nation_id"
FROM "supplier_renamed_trimmed_cleaned"
)
-- COCOON BLOCK END
SELECT *
FROM "supplier_renamed_trimmed_cleaned_casted"
stg_supplier.yml (Document the table)
version: 2
models:
- name: stg_supplier
description: The table is about suppliers. It contains details of each supplier,
including their unique key, name, address, nation key, phone number, account balance,
and a comment. Each row represents a different supplier with their specific information.
The table appears to be part of a larger database system, likely for managing
supplier information in a business context.
columns:
- name: supplier_id
description: Unique identifier for the supplier
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each supplier.
For this table, each row represents a different supplier, and supplier_id
appears to be unique across rows (e.g., Supplier#000000001, Supplier#000000002).
data_type:
current_data_type: VARCHAR
- name: supplier_name
description: Name of the supplier
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: supplier_address
description: Address of the supplier
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: supplier_phone
description: Phone number of the supplier
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: supplier_comments
description: Additional comments or notes about the supplier
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: VARCHAR
- name: account_balance
description: Account balance of the supplier
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: DECIMAL
- name: nation_id
description: Key representing the supplier's nation
cocoon_meta:
missing_reason: Unknown
data_type:
current_data_type: INT
tests: []
stg_region
Table Preview (first 5 rows)
region_id | region_name | random_comment | |
---|---|---|---|
0 | 0 | AFRICA | lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1 | 1 | AMERICA | hs use ironic even requests. s |
2 | 2 | ASIA | ges. thinly even pinto beans ca |
3 | 3 | EUROPE | ly final courts cajole furiously final excuse |
4 | 4 | MIDDLE EAST | uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl |
stg_region.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:27:04.921327+00:00
WITH
"region_renamed" AS (
-- Rename: Renaming columns
-- R_REGIONKEY -> region_id
-- R_NAME -> region_name
-- R_COMMENT -> random_comment
SELECT
"R_REGIONKEY" AS "region_id",
"R_NAME" AS "region_name",
"R_COMMENT" AS "random_comment"
FROM "memory"."main"."region"
)
-- COCOON BLOCK END
SELECT *
FROM "region_renamed"
stg_region.yml (Document the table)
version: 2
models:
- name: stg_region
description: The table is about geographical regions. It contains details of different
world regions. Each region has a unique key, a name, and a comment. The regions
included are Africa, America, Asia, Europe, and Middle East. The comments seem
to be randomly generated text without specific meaning related to the regions.
columns:
- name: region_id
description: Unique identifier for each region
tests:
- not_null
- unique
cocoon_meta:
unique_reason: This column represents a unique identifier for each region. For
this table, each row represents a distinct geographical region. The region_id
appears to be unique across rows, as it's a standard practice to use unique
identifiers for such data.
data_type:
current_data_type: INT
- name: region_name
description: Name of the geographical region
tests:
- not_null
- unique
- accepted_values:
values:
- AFRICA
- AMERICA
- ASIA
- EUROPE
- MIDDLE EAST
cocoon_meta:
unique_reason: This column contains the name of each geographical region. For
this table, each row represents a distinct region, and the names provided
(Africa, America, Asia, Europe, Middle East) are unique. However, in a larger
dataset, there could potentially be regions with the same name (e.g., North
America and South America both containing "America"), so it's not guaranteed
to be unique.
future_accepted_values:
- OCEANIA
- ANTARCTICA
data_type:
current_data_type: VARCHAR
- name: random_comment
description: Random text associated with each region
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: VARCHAR
tests: []
stg_lineitem
Table Preview (first 5 rows)
shipping_instructions | supplier_id | extended_price | part_id | comments | order_id | quantity | return_flag | line_status | line_number | tax_rate | discount_rate | shipping_mode | commit_date | receipt_date | ship_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DELIVER IN PERSON | 93 | 24710.35 | 1552 | None | 1 | 17 | N | O | 1 | 0.02 | 0.04 | TRUCK | 1996-02-12 | 1996-03-22 | 1996-03-13 |
1 | TAKE BACK RETURN | 75 | 56688.00 | 674 | None | 1 | 36 | N | O | 2 | 0.06 | 0.09 | 1996-02-28 | 1996-04-20 | 1996-04-12 | |
2 | TAKE BACK RETURN | 99 | 37850.00 | 1 | None | 2 | 38 | N | O | 1 | 0.05 | 0.00 | RAIL | 1997-01-14 | 1997-02-02 | 1997-01-28 |
3 | None | 56 | 8376.00 | 355 | None | 3 | 8 | R | F | 1 | 0.02 | 0.07 | AIR | 1994-01-04 | 1994-02-23 | 1994-02-02 |
4 | DELIVER IN PERSON | 79 | 47352.00 | 278 | None | 3 | 42 | R | F | 2 | 0.04 | 0.08 | TRUCK | 1994-02-09 | 1994-01-27 | 1994-01-13 |
stg_lineitem.sql (clean the table)
-- COCOON BLOCK START: PLEASE DO NOT MODIFY THIS BLOCK FOR SELF-MAINTENANCE
-- Generated at 2024-08-25 23:23:44.499443+00:00
WITH
"lineitem_renamed" AS (
-- Rename: Renaming columns
-- L_ORDERKEY -> order_id
-- L_PARTKEY -> part_id
-- L_SUPPKEY -> supplier_id
-- L_LINENUMBER -> line_number
-- L_QUANTITY -> quantity
-- L_EXTENDEDPRICE -> extended_price
-- L_DISCOUNT -> discount_rate
-- L_TAX -> tax_rate
-- L_RETURNFLAG -> return_flag
-- L_LINESTATUS -> line_status
-- L_SHIPDATE -> ship_date
-- L_COMMITDATE -> commit_date
-- L_RECEIPTDATE -> receipt_date
-- L_SHIPINSTRUCT -> shipping_instructions
-- L_SHIPMODE -> shipping_mode
-- L_COMMENT -> comments
SELECT
"L_ORDERKEY" AS "order_id",
"L_PARTKEY" AS "part_id",
"L_SUPPKEY" AS "supplier_id",
"L_LINENUMBER" AS "line_number",
"L_QUANTITY" AS "quantity",
"L_EXTENDEDPRICE" AS "extended_price",
"L_DISCOUNT" AS "discount_rate",
"L_TAX" AS "tax_rate",
"L_RETURNFLAG" AS "return_flag",
"L_LINESTATUS" AS "line_status",
"L_SHIPDATE" AS "ship_date",
"L_COMMITDATE" AS "commit_date",
"L_RECEIPTDATE" AS "receipt_date",
"L_SHIPINSTRUCT" AS "shipping_instructions",
"L_SHIPMODE" AS "shipping_mode",
"L_COMMENT" AS "comments"
FROM "memory"."main"."lineitem"
),
"lineitem_renamed_trimmed" AS (
-- Trim Leading and Trailing Spaces
SELECT
"order_id",
"part_id",
"supplier_id",
"line_number",
"quantity",
"extended_price",
"discount_rate",
"tax_rate",
"return_flag",
"line_status",
"ship_date",
"commit_date",
"receipt_date",
"shipping_instructions",
"shipping_mode",
TRIM("comments") AS "comments"
FROM "lineitem_renamed"
),
"lineitem_renamed_trimmed_cleaned" AS (
-- Clean unusual string values:
-- comments: The problem is that all the values in the comments column are incomplete phrases or sentence fragments without clear meaning or context. They appear to be truncated or partial sentences, making them difficult to interpret or use meaningfully. Since there's no clear pattern or way to reconstruct the full sentences, and the fragments don't carry any coherent meaning on their own, the best approach is to map all these unusual values to an empty string.
SELECT
"order_id",
"part_id",
"supplier_id",
"line_number",
"quantity",
"extended_price",
"discount_rate",
"tax_rate",
"return_flag",
"line_status",
"ship_date",
"commit_date",
"receipt_date",
"shipping_instructions",
"shipping_mode",
CASE
WHEN "comments" = 'blithely bold excuses wake fluffily' THEN NULL
WHEN "comments" = 'egular courts above the' THEN NULL
WHEN "comments" = 'ly final dependencies: slyly bold' THEN NULL
WHEN "comments" = 'ongside of the furiously brave acco' THEN NULL
WHEN "comments" = 'ven requests. deposits breach a' THEN NULL
ELSE "comments"
END AS "comments"
FROM "lineitem_renamed_trimmed"
),
"lineitem_renamed_trimmed_cleaned_null" AS (
-- NULL Imputation: Impute Null to Disguised Missing Values
-- shipping_instructions: ['NONE']
SELECT
CASE
WHEN "shipping_instructions" = 'NONE' THEN NULL
ELSE "shipping_instructions"
END AS "shipping_instructions",
"receipt_date",
"supplier_id",
"extended_price",
"part_id",
"comments",
"order_id",
"ship_date",
"quantity",
"return_flag",
"line_status",
"line_number",
"tax_rate",
"commit_date",
"discount_rate",
"shipping_mode"
FROM "lineitem_renamed_trimmed_cleaned"
),
"lineitem_renamed_trimmed_cleaned_null_casted" AS (
-- Column Type Casting:
-- commit_date: from VARCHAR to DATE
-- receipt_date: from VARCHAR to DATE
-- ship_date: from VARCHAR to DATE
SELECT
"shipping_instructions",
"supplier_id",
"extended_price",
"part_id",
"comments",
"order_id",
"quantity",
"return_flag",
"line_status",
"line_number",
"tax_rate",
"discount_rate",
"shipping_mode",
CAST("commit_date" AS DATE)
AS "commit_date",
CAST("receipt_date" AS DATE)
AS "receipt_date",
CAST("ship_date" AS DATE)
AS "ship_date"
FROM "lineitem_renamed_trimmed_cleaned_null"
)
-- COCOON BLOCK END
SELECT *
FROM "lineitem_renamed_trimmed_cleaned_null_casted"
stg_lineitem.yml (Document the table)
version: 2
models:
- name: stg_lineitem
description: The table is about order line items. It contains details such as order
key, part key, supplier key, quantity, price, discount, tax, shipping dates, and
status. Each row represents a specific item within an order, including its shipping
instructions, mode, and comments. The table provides a comprehensive view of individual
items ordered, their pricing, and logistics information.
columns:
- name: shipping_instructions
description: Special shipping instructions
tests:
- accepted_values:
values:
- DELIVER IN PERSON
- TAKE BACK RETURN
cocoon_meta:
missing_reason: Unknown
future_accepted_values:
- SIGNATURE REQUIRED
- LEAVE AT DOOR
- CALL BEFORE DELIVERY
- DO NOT LEAVE UNATTENDED
- FRAGILE - HANDLE WITH CARE
- REFRIGERATE UPON ARRIVAL
- KEEP DRY
- THIS SIDE UP
- PERISHABLE
- HAZARDOUS MATERIALS
- INSURANCE REQUIRED
- APPOINTMENT NECESSARY
- NO WEEKEND DELIVERY
data_type:
current_data_type: VARCHAR
- name: supplier_id
description: Identifier for the supplier
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: extended_price
description: Total price for the line item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: part_id
description: Identifier for the part/product
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: comments
description: Additional comments about the line item
cocoon_meta:
missing_reason: Optional field for additional information, not always needed.
data_type:
current_data_type: VARCHAR
- name: order_id
description: Unique identifier for the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: quantity
description: Quantity of the item ordered
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: return_flag
description: Return status flag
tests:
- not_null
- accepted_values:
values:
- N
- R
cocoon_meta:
future_accepted_values:
- P
- A
- C
data_type:
current_data_type: VARCHAR
- name: line_status
description: Status of the line item
tests:
- not_null
- accepted_values:
values:
- O
- F
cocoon_meta:
future_accepted_values:
- C
- P
- B
- R
- S
- H
data_type:
current_data_type: VARCHAR
- name: line_number
description: Line number of the item within the order
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: INT
- name: tax_rate
description: Tax rate applied to the item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: discount_rate
description: Discount rate applied to the item
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DECIMAL
- name: shipping_mode
description: Mode of shipping
tests:
- not_null
- accepted_values:
values:
- TRUCK
- AIR
- MAIL
- RAIL
cocoon_meta:
future_accepted_values:
- SHIP
- SEA
- EXPRESS
- FREIGHT
- COURIER
data_type:
current_data_type: VARCHAR
- name: commit_date
description: Date the item was committed for shipping
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DATE
- name: receipt_date
description: Date the item was received
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DATE
- name: ship_date
description: Date the item was shipped
tests:
- not_null
cocoon_meta:
data_type:
current_data_type: DATE
tests: []
Join Graph (FK to PK)
cocoon_join.yml (Document the joins)
join_graph:
- table_name: stg_nation
primary_key: nation_id
foreign_keys:
- column: region_id
reference:
table_name: stg_region
column: region_id
- table_name: stg_customer
foreign_keys:
- column: nation_id
reference:
table_name: stg_nation
column: nation_id
referential_integrity:
Orphan: 100.00%
explanation: ''
primary_key: customer_id
- table_name: stg_supplier
foreign_keys:
- column: nation_id
reference:
table_name: stg_nation
column: nation_id
referential_integrity:
Orphan: 50.00%
explanation: ''
primary_key: supplier_id
- table_name: stg_order
primary_key: order_id
foreign_keys:
- column: customer_id
reference:
table_name: stg_customer
column: customer_id
time_keys:
- order_date
- table_name: stg_lineitem
foreign_keys:
- column: order_id
reference:
table_name: stg_order
column: order_id
- column: part_id
reference:
table_name: stg_part
column: part_id
referential_integrity:
Orphan: 80.00%
explanation: ''
- column: supplier_id
reference:
table_name: stg_supplier
column: supplier_id
time_keys:
- commit_date
- receipt_date
- ship_date
- table_name: stg_part
primary_key: part_id
foreign_keys: []
- table_name: stg_partsupp
foreign_keys:
- column: part_key
reference:
table_name: stg_part
column: part_id
- column: supplier_key
reference:
table_name: stg_supplier
column: supplier_id
- table_name: stg_region
primary_key: region_id
foreign_keys: []
All
Process Story
cocoon_er.yml (Document the entity relationships)
groups: []
entities:
- entity_name: Nations
entity_description: Represents countries with their unique identifiers, names, region
associations, and brief descriptions.
table_name: stg_nation
primary_key: nation_id
- entity_name: Orders
entity_description: Represents customer orders with details such as order ID, customer
information, financial data, dates, and processing information.
table_name: stg_order
primary_key: order_id
- entity_name: Parts
entity_description: Represents product parts with their unique identifiers, names,
specifications, pricing, and other attributes.
table_name: stg_part
primary_key: part_id
- entity_name: Suppliers
entity_description: Represents suppliers with their unique identifiers, contact
information, location details, and financial data.
table_name: stg_supplier
primary_key: supplier_id
- entity_name: Regions
entity_description: Represents geographical regions with their unique identifiers,
names, and associated comments.
table_name: stg_region
primary_key: region_id
- entity_name: Customers
entity_description: Represents customers with their unique identifiers, contact
information, financial data, market segments, and other relevant details.
table_name: stg_customer
primary_key: customer_id
relations:
- relation_name: NationRegionAssociation
relation_description: Nations belong to specific Regions, with each Nation associated
with one Region through a region_id.
table_name: stg_nation
entities:
- Nations
- Regions
- relation_name: CustomerOrders
relation_description: This captures the Orders placed by Customers, including order
details, status, and pricing information.
table_name: stg_order
entities:
- Orders
- Customers
- relation_name: SupplierNationAffiliation
relation_description: Suppliers are associated with Nations, indicating the country
where each supplier is based or operates from.
table_name: stg_supplier
entities:
- Suppliers
- Nations
- relation_name: CustomerNationAssociation
relation_description: This stores the Customers and their associated Nations, where
each Customer belongs to a specific Nation.
table_name: stg_customer
entities:
- Customers
- Nations
- relation_name: OrderLineItems
relation_description: Suppliers provide Parts that are included in Orders as line
items with specific quantities, prices, and shipping details.
table_name: stg_lineitem
entities:
- Orders
- Parts
- Suppliers
- relation_name: PartSupplierInventory
relation_description: This represents the availability and pricing of Parts from
specific Suppliers, including quantity and supply cost details.
table_name: stg_partsupp
entities:
- Parts
- Suppliers
story:
- name: NationRegionAssociation
description: Nations group into regions for economic cooperation.
type: relation
- name: SupplierNationAffiliation
description: Suppliers register their operations in specific nations.
type: relation
- name: CustomerNationAssociation
description: Customers provide their national location for business.
type: relation
- name: PartSupplierInventory
description: Suppliers list available parts with prices and quantities.
type: relation
- name: CustomerOrders
description: Customers place orders for desired products.
type: relation
- name: OrderLineItems
description: Orders detail parts, quantities, and chosen suppliers.
type: relation