Meta-Data Driven Data Ingestion Using Mapreduce Framework

PublishedFebruary 3, 2015

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

27 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for automatically ingesting data into a data warehouse, comprising: providing a datahub server for executing data loading tasks; providing a generic pipelined data loading framework that leverages a MapReduce environment for ingestion of a plurality of heterogeneous data sources; and providing a processor implemented meta-data model comprised of a plurality of configuration files and a catalog; wherein a configuration file is setup per ingestion task; wherein said catalog manages data warehouse schema; wherein when a scheduled data loading task is executed by said datahub server; and wherein said configuration files and said catalog collaboratively drive the datahub server to load the heterogeneous data to their destination schemas automatically and independently of data source heterogeneities and data warehouse schema evolvement.

2. The method of claim 1 , wherein said data sources comprise marketing-related data from a plurality of different media channels.

3. An apparatus for automatically ingesting data into a data warehouse, comprising: a datahub server for executing data loading tasks comprising heterogeneous data received from a plurality of different servers; a generic pipelined data loading framework that leverages a MapReduce environment for ingestion of a plurality of heterogeneous data sources; and a processor implemented meta-data model comprised of a plurality of configuration files and a catalog; wherein a configuration file is setup per ingestion task; wherein said catalog manages data warehouse schema; wherein when a scheduled data loading task is executed by said datahub server; and wherein said configuration files and said catalog collaboratively drive the datahub server to load the heterogeneous data to their destination schemas automatically and independently of data source heterogeneities and data warehouse schema evolvement into a Hadoop cluster.

4. The apparatus of claim 3 , wherein said datahub server monitors and coordinates pipeline jobs by communicating with the Hadoop cluster and a ZooKeeper server.

5. The apparatus of claim 4 , wherein each ingestion task receives source files and extracts, transforms, and loads data comprising said source files through said pipelined data loading framework to a destination location; wherein progress of said pipelined data loading framework is monitored by a pipeline status file; wherein synchronization of access to said pipeline status file is performed via communication between said datahub server and said ZooKeeper server.

6. The apparatus of claim 3 , wherein said pipelined data loading framework is run sequentially; and wherein for most stages, said pipelined data loading framework invokes MapReduce jobs on said Hadoop cluster to accomplish a corresponding task.

7. The apparatus of claim 3 , wherein said data sources comprise marketing-related data from a plurality of different media channels.

8. A method for automatically ingesting data into a data warehouse, comprising: providing a datahub server for executing data loading tasks; providing a generic pipelined data loading framework leverages a MapReduce environment for ingestion of a plurality of heterogeneous data sources; and providing a processor implemented meta-data model comprised of a plurality of configuration files and a catalog; wherein a configuration file is setup per ingestion task; wherein said catalog manages data warehouse schema; wherein when a scheduled data loading task is executed by said datahub server; wherein said configuration files and said catalog collaboratively drive the datahub server to load the heterogeneous data to their destination schemas automatically and independently of data source heterogeneities and data warehouse schema evolvement; said datahub server executing said data loading task by: downloading and transforming a job running on said datahub server by referring to a configuration file and pipeline status files to determine where, what, and how to download the data source files to a local working directory, and then transforming the files.

9. The method of claim 8 , wherein said job comprises any of a sanity check job, MapReduce (MR) join job, ingestion job, download and transform job, and commit job, and wherein said sanity check job, MR join job, and ingestion job each comprise a MapReduce job driven by the datahub server and running on a Hadoop cluster.

10. The method of claim 8 , said datahub server executing said data loading task by performing the further step of: sanity checking said job; said datahub server driving a MapReduce job to parse input files produced by said downloading and transforming once, determining whether the input file is a valid data source, and then passing valid input files to a next job in the pipelined data loading framework.

11. The method of claim 10 , said datahub server executing said data loading task by performing the further step of: MapReduce (MR) joining said job by using a MapReduce framework, first reading both newly arrived clients' files and existing destination data warehouse files, and then performing a join of the newly arrived clients' files and existing destination data warehouse files to produce a result for a next job to consume.

12. The method of claim 11 , said datahub server executing said data loading task by performing the further step of: committing the job by renaming previous job output folders to an output folder whose contents are to be consumed by an ingestion job.

13. The method of claim 12 , said datahub server executing said data loading task by performing the further step of: ingesting said MapReduce job, by consuming all join output from previous stages of the pipeline and ingesting all join results into destination data files.

14. A datahub server, comprising: a processor implemented framework for leveraging a MapReduce environment to route source data to a destination; said framework consulting meta-data to carry out different instances of a pipeline to perform ingestion tasks; wherein meta-data modeling during data ingestion comprises destination schema modeling via a catalog, and client configuration modeling per ingestion task via a configuration file.

15. The datahub server of claim 14 , said catalog supporting schema evolution without changing framework code by modeling destination schema using any of the following schema properties: a property that maintains an integer array representing all available schemas, wherein for each table schema, a unique integer is assigned as its identity (ID); a property that stores a descriptive name of a table identified by ID; a property that stores a latest version of a table identified by ID; a property that stores an absolute Hadoop file system (HDFS) path where a table identified by ID is stored; a property that stores a versioned schema of a table identified by ID; and a property that stores default values of a versioned schema of a table identified by ID; wherein said properties record an evolvement history of each schema; and wherein a record can be dynamically evolved from an earlier version of a given schema to a later version of the same schema by consulting said catalog.

16. The datahub server of claim 15 , wherein record evolvement to a same schema from a first version to a second version comprises: creating a default record of a same schema using said second different version, wherein said default record is instantiated with default values of said second version; and looking up said catalog to find differences between said first version's and said second version's schemas, and automatically using said first version's data to replace data in said default record; wherein if there is a correspondence between said first version's column and said second version's column, a direct copy is performed, with type casting if it is necessary; wherein if there is no such correspondence for said first version's column, that column is dropped; and wherein a new record is created containing said second version's schema with either said first version's data or the default value of said second version.

17. The datahub server of claim 14 , wherein said configuration file is setup per data ingestion task to address schema mapping and other heterogeneity issues using any of the following properties: a least two properties that identify to which schema and version source files go; a property that identifies a date format used by the source data; a property that defines mappings between the source data's schema and a destination schema; a property that identifies to which physical table partition the source data goes; a property that identifies a file transfer protocol; a property that identifies a user name that is used to login to a data source server; and a property that identifies a password that is used to login to the data source server.

18. The datahub server of claim 14 , said datahub server further comprising: a record abstraction facility for version reconciliation; wherein a record is a class, data structure, storing one tuple of a given schema and a given version combination, said record comprising a value array and a versioned schema; wherein the value array holds a binary of the data and the schema keeps meta-data for the data; and wherein the schema is an in-memory representation of a versioned schema specified in the catalog.

19. The datahub server of claim 18 , said record abstraction facility further comprising: a function which, when invoked, converts a current record to a latest version of the current record's schema by consulting the catalog.

20. A process, comprising: providing a processor implemented framework for leveraging a MapReduce environment to route source data to a destination; said framework consulting meta-data to carry out different instances of a pipeline to perform ingestion tasks; wherein meta-data modeling during data ingestion comprises destination schema modeling via a catalog, and client configuration modeling per ingestion task via a configuration file.

21. The process of claim 20 , said catalog supporting schema evolution without changing framework code by modeling destination schema using any of the following schema properties: a property that maintains an integer array representing all available schemas, wherein for each table schema, a unique integer is assigned as its identity (ID); a property that stores a descriptive name of a table identified by ID; a property that stores a latest version of a table identified by ID; a property that stores an absolute Hadoop file system (HDFS) path where a table identified by ID is stored; a property that stores a versioned schema of a table identified by ID; and a property that stores default values of a versioned schema of a table identified by ID; wherein said properties record an evolvement history of each schema; and wherein a record can be dynamically evolved from an earlier version of a given schema to a later version of the same schema by consulting said catalog.

22. The process of claim 21 , wherein record evolvement to a same schema from a first version to a second version comprises: creating a default record of a same schema using said second different version, wherein said default record is instantiated with default values of said second version; and looking up said catalog to find differences between said first version's and said second version's schemas, and automatically using said first version's data to replace data in said default record; wherein if there is a correspondence between said first version's column and said second version's column, a direct copy is performed, with type casting if it is necessary; wherein if there is no such correspondence for said first version's column, that column is dropped; and wherein a new record is created containing said second version's schema with either said first version's data or the default value of said second version.

23. The process of claim 20 , wherein said configuration file is setup per data ingestion task to address schema mapping and other heterogeneity issues using any of the following properties: one or more properties that identify to which schema and version source files go; a property that identifies a date format used by the source data; a property that defines mappings between the source data's schema and a destination schema; a property that identifies to which physical table partition the source data goes; a property that identifies a file transfer protocol; a property that identifies a user name that is used to login to a data source server; and a property that identifies a password that is used to login to the data source server.

24. The process of claim 20 , further comprising: providing a record abstraction facility for version reconciliation; wherein a record is a class, data structure, storing one tuple of a given schema and a given version combination, said record comprising a value array and a versioned schema; wherein the value array holds a binary of the data and the schema keeps meta-data for the data; and wherein the schema is an in-memory representation of a versioned schema specified in the catalog.

25. The process of claim 24 , further comprising: providing a function which, when invoked, converts a current record to a latest version of the current record's schema.

26. A method for automatically ingesting data into a data warehouse, comprising: providing a datahub server for following one or more configuration files to load a plurality of heterogeneous sources of data into a Hadoop file system (HDFS); said datahub server launching a MapReduce job to join all of said heterogeneous data sources with existing data in a common destination schema; and said datahub server performing a join task to join client data with existing data of a same schema by launching a MapReduce job which reads all newly arrived data and the existing data of the destination schema, and which performs the join in a reducer of a MapReduce framework.

27. An apparatus for automatically ingesting data into a data warehouse, comprising: a datahub server following one or more configuration files to load a plurality of heterogeneous sources of data into a Hadoop file system (HDFS); said datahub server launching a MapReduce job to join all of said heterogeneous data sources with existing data in a common destination schema; and said datahub server performing a join task to join client data with existing data of a same schema by launching a MapReduce job which reads all newly arrived data and the existing data of the destination schema, and which performs the join in a reducer of a MapReduce framework.

Patent Metadata

Filing Date

Unknown

Publication Date

February 3, 2015

Inventors

Mingxi Wu

Songting Chen

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search